Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
Hive requires you to provide table schema even if you create table based on
folder having ORC files (ORC file already has schema internally). It's a
shame because ORC is Hive internal project originally

Spark can create table based on ORC or Parquet files automatically without
asking you to provide schema.

val myTableDf =
getSqlContext.read.format("orc").load("s3n://mycompany/hive/warehouse/my_table_orc")

myTableDf.printSchema()

myTableDf.show()


You do not even need to register table to do SQL based ETL against
files (added in Spark 1.6)

SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
queries over files of any supported format without registering a
table.

https://issues.apache.org/jira/browse/SPARK-11197


I think now it's more clear why all companies move to Spark to do ETL.


On Fri, Jan 15, 2016 at 3:06 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Probably Bryan can try both Hive and Spark and decide which one better
> works for him.
>
> The fact is - lots of companies migrate from Hadoop/Hive to Spark
>
> if you like writing ETL using Spark API the you can use map, reduceByKey,
> groupByKeym, join, distinct, etc API
> if you like using SQL then you can do it by running sqlContext.sq("select
> ")
> In addition to SQL you can also use DataFrame API.
>
> Hive only allows you to use SQL
>
> BTW, most of Hive UDFs are available in Spark
> Plus Spark allows you to create UDF on fly right in your script, e.g.
>
> sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in)
>
> sqlContext.sql("select cube(4) c").show()
>
> +---+
> | c |
> +---+
> | 64|
> +---+
>
> On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> Ok but I believe there are other similar approaches.
>>
>>
>>
>> I can take a raw csv file and customize it using existing shell commands
>> like sed, awk, cut, grep etc among them getting rid of blank lines or
>> replacing silly characters.
>>
>>
>>
>> Bottom line I want to “eventually” store that csv file in a hive table in
>> a format that I can use sql queries on it.
>>
>>
>>
>> Is that a viable alternative?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
>> *Sent:* 15 January 2016 21:51
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Loading data containing newlines
>>
>>
>>
>> You can open a file as an RDD of lines, and map whatever custom
>> tokenisation function you want over it; alternatively you can partition
>> down to a reasonable size and use map_partitions to map the standard python
>> csv parser over the partitions.
>>
>>
>>
>> In general, the advantage of spark is that you can do anything you like
>> rather than being limited to a specific set of prim

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* 15 January 2016 21:39
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> I second this. I've generally found anything else to be disappointing when
> working with data which is at all funky.
>
>
>
> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
> Time to use Spark and Spark-Sql in addition to Hive?
>
> It's probably going to happen sooner or later anyway.
>
>
>
> I sent you Spark solution yesterday.  (you just need to write 
> unbzip2AndCsvToListOfArrays(file:
> String): List[Array[String]]  function using BZip2CompressorInputStream
> and Super CSV API)
>
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
>
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
>
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov>
> wrote:
>
> 1.   hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.   CREATE EXTERNAL TABLE stg_ (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.   CREATE TABLE  (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.   INSERT INTO TABLE  SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_
>
> 5.   Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE  name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on

Re: Loading data containing newlines

2016-01-13 Thread Alexander Pivovarov
Time to use Spark and Spark-Sql in addition to Hive?
It's probably going to happen sooner or later anyway.

I sent you Spark solution yesterday.  (you just need to write
unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function
using BZip2CompressorInputStream and Super CSV API)
you can download spark,  open spark-shell and run/debug the program on a
single computer

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
cluster in 7 min)

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W 
wrote:

> 1.   hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.   CREATE EXTERNAL TABLE stg_ (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.   CREATE TABLE  (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.   INSERT INTO TABLE  SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_
>
> 5.   Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE  name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
> letting Hive handle the ORC file creation process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk ]
>
> *Sent:* Tuesday, January 12, 2016 4:41 PM
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Hi Bryan,
>
>
>
> As a matter of interest are you loading text files into local directories
> in encrypted format at all and then push it into HDFS/Hive as ORC?
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is 

Re: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
Try CSV serde. It should correctly parse quoted field value having newline
inside
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Hadoop should automatically read bz2 files


On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W 
wrote:

> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>


RE: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
I tried spark-csv with file having new line inside field value - does not
work as well

$ cat /tmp/cars.csv
1,"Hello1
world"
2,"Hello2"
3,"Hello3"

scala> val df = sqlContext.read.
 |   format("com.databricks.spark.csv").
 |   load("/tmp/cars.csv")
java.io.IOException: (startline 1) EOF reached before encapsulated token
finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:223)
at com.databricks.spark.csv.CsvRelation.(CsvRelation.scala:72)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:157)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)

what you can do is
1. read list of files to Array[String]
val files = getListOfFiles(dir)

2. create RDD out of it and repartition by files.length (So, each task get
one file)
val filesRdd = sc.parallelize(files, files.length)

3. unbzip2 and parse. 1 file = 1 task

val lines: RDD[Array[String]] = filesRdd.flatMap(file =>
unbzip2AndCsvToListOfArrays(file))

unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] can use csv
parser which understands new line inside field value, e.g. Super CSV

4. create RDD of Rows
val rows = lines.map(line => Row.fromSeq(line.toSeq))

5. create dataframe
val df = getSqlContext.createDataFrame(rows, schema)

schema describes column name and types.

6. save df as ORC
df.repartition(outputFilesCount).write.format("orc").save(outputPath)


On Jan 12, 2016 9:58 AM, "Gerber, Bryan W" <bryan.ger...@pnnl.gov> wrote:

> From that wiki:
>
> "This SerDe works for most CSV data, but does not handle embedded
> newlines."
>
>
>
> The Hive SerDe interface is all downstream of the TextInputFormat, which
> has already split records by newlines.  In theory you can give it a
> different line delimiter, but Hive 1.2.1 does not support it: "FAILED:
> SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right
> now."
>
>
>
> *From:* Alexander Pivovarov [mailto:apivova...@gmail.com]
> *Sent:* Tuesday, January 12, 2016 9:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> Try CSV serde. It should correctly parse quoted field value having newline
> inside
>
> https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
>
>
>
> Hadoop should automatically read bz2 files
>
>
>
>
>
> On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <bryan.ger...@pnnl.gov>
> wrote:
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>


Re: Create table from ORC or Parquet file?

2015-12-09 Thread Alexander Pivovarov
E.g. in Spark SQL I can create temporary table from ORC, Parquet or json
files without specifying column names and types

val myDf = sqlContext.read.format("orc").load("s3n://alex/test/mytable_orc")

myDf.printSchema
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- rc_state: string (nullable = true)
 |-- rc_county_name: string (nullable = true)

myDf.registerTempTable("mytable")
val res = sqlContext.sql("""
  select rc_state, count(*) cnt
  from mytable
  group by rc_state
  order by rc_state""")

res.show(10)
++---+
|rc_state|cnt|
++---+
|  AK| 37|
|  AL|224|
|  AR|109|
|  AZ| 81|
|  CA|417|
|  CO|145|
|  CT| 71|
|  DC| 15|
|  DE| 27|
|  FL|452|
++---+
only showing top 10 rows

Lots of companies switch to Spark for ETL. But Hive is still used by many
people, reporting tools or legacy solutions to select data from files
(tables) prepared by Spark.
It would be nice if Hive can create table based on ORC or Parquet file(s)
without specifying table columns and types. Integration with Spark output
will be easier.


On Wed, Dec 9, 2015 at 9:50 AM, Owen O'Malley <omal...@apache.org> wrote:

> So your use case is that you already have the ORC files and you want a
> table that can read those files without specifying the columns in the
> table? Obviously without the columns being specified Hive wouldn't be able
> to write to that table, so I assume you only care about reading it. Is that
> right?
>
> .. Owen
>
> On Wed, Dec 2, 2015 at 9:53 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
>> Hi Everyone
>>
>> Is it possible to create Hive table from ORC or Parquet file without
>> specifying field names and their types. ORC or Parquet files contain field
>> name and type information inside.
>>
>> Alex
>>
>
>


Create table from ORC or Parquet file?

2015-12-02 Thread Alexander Pivovarov
Hi Everyone

Is it possible to create Hive table from ORC or Parquet file without
specifying field names and their types. ORC or Parquet files contain field
name and type information inside.

Alex


Re: join 2 tables located on different clusters

2015-06-25 Thread Alexander Pivovarov
I tried to reproduce Wrong FS issue in several hive branches

branch-0.14 - works
branch-1.0 - works
branch-1.1 - throws exception

Looks like the error was introduced in 1.1.0 by the following
https://issues.apache.org/jira/browse/HIVE-9264

I opened new JIRA for the issue
https://issues.apache.org/jira/browse/HIVE-6


On Wed, Jun 24, 2015 at 4:08 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 I tried on local hadoop/hive instance  (hive is the latest from master
 branch)

 mydev is ha alias to remote ha name node.

 $ hadoop fs -ls hdfs://mydev/tmp/et1
 Found 1 items
 -rw-r--r--   3 myapp hadoop 16 2015-06-24 16:05
 hdfs://mydev/tmp/et1/et1file

 $ hive

 hive CREATE TABLE et1 (
   a string
 ) stored as textfile
 LOCATION 'hdfs://mydev/tmp/et1';

 hive select * from et1;

 15/06/24 16:01:08 [main]: ERROR parse.CalcitePlanner:
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if
 hdfs://mydev/tmp/et1 is encrypted: java.lang.IllegalArgumentException:
 Wrong FS: hdfs://mydev/tmp/et1, expected: hdfs://localhost:8020
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1870)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:1947)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStagingDirectoryPathname(SemanticAnalyzer.java:1979)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1792)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1527)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10057)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10108)
 at
 org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:207)
 at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
 at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1124)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1172)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1061)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1051)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
 at
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 Caused by: java.lang.IllegalArgumentException: Wrong FS:
 hdfs://mydev/tmp/et1, expected: hdfs://localhost:8020
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1906)
 at
 org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
 at
 org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1210)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1866)
 ... 26 more

 FAILED: SemanticException Unable to determine if hdfs://mydev/tmp/et1 is
 encrypted: java.lang.IllegalArgumentException: Wrong FS:
 hdfs://mydev/tmp/et1, expected: hdfs://localhost:8020
 15/06/24 16:01:08 [main]: ERROR ql.Driver: FAILED: SemanticException
 Unable to determine if hdfs://mydev/tmp/et1 is encrypted:
 java.lang.IllegalArgumentException: Wrong FS: hdfs://mydev/tmp/et1,
 expected: hdfs://localhost:8020
 org.apache.hadoop.hive.ql.parse.SemanticException: Unable to determine if
 hdfs://mydev/tmp/et1 is encrypted: java.lang.IllegalArgumentException:
 Wrong FS: hdfs://mydev/tmp/et1, expected: hdfs://localhost:8020
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1850)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1527)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10057

join 2 tables located on different clusters

2015-06-24 Thread Alexander Pivovarov
Hello Everyone

Can I define external table on cluster_1 pointing to hdfs location on
cluster_2?
I tried and got some strange exception in hive
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.reflect.InvocationTargetException)

I want to do full outer join btw table A which exist on cluster_1 and table
A on cluster_2.

My idea was to create external table A_2 (on cluster_1) which points to
cluster_2 and run hive query on cluster_1

select a.*, a_2.*
from a
full outer join a_2 on (a.id = a_2.id)


Re: join 2 tables located on different clusters

2015-06-24 Thread Alexander Pivovarov
(Driver.java:1172)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1061)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1051)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to
determine if hdfs://mydev/tmp/et1 is encrypted:
java.lang.IllegalArgumentException: Wrong FS: hdfs://mydev/tmp/et1,
expected: hdfs://localhost:8020
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1870)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:1947)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStagingDirectoryPathname(SemanticAnalyzer.java:1979)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1792)
... 23 more
Caused by: java.lang.IllegalArgumentException: Wrong FS:
hdfs://mydev/tmp/et1, expected: hdfs://localhost:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1906)
at
org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
at
org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1210)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:1866)
... 26 more


On Wed, Jun 24, 2015 at 1:37 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 I do not know what your exact problem is. Set you debug logging on. This
 can be done however assuming both clusters have network access to each other

 On Wed, Jun 24, 2015 at 4:33 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Hello Everyone

 Can I define external table on cluster_1 pointing to hdfs location on
 cluster_2?
 I tried and got some strange exception in hive
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask.
 MetaException(message:java.lang.reflect.InvocationTargetException)

 I want to do full outer join btw table A which exist on cluster_1 and
 table A on cluster_2.

 My idea was to create external table A_2 (on cluster_1) which points to
 cluster_2 and run hive query on cluster_1

 select a.*, a_2.*
 from a
 full outer join a_2 on (a.id = a_2.id)





Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Alexander Pivovarov
Thank you Xuefu!

Excellent explanation and comparison!
We should put it to Hive on Spark wiki.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark


On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 I have been working on HIve on Spark, and knows a little about SparkSQL.
 Here are a few factors to be considered:

 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
 front end (parser and semantic analyzer) and metastore, and inject in
 between a laryer where Hive's operator tree is reinterpreted in Spark's
 constructs (transactions and actions). Thus, it's tied to a specific
 version of Hive, which is always behind official Hive releases.
 2. Because of the reinterpretation, many features (window functions,
 lateral views, etc) from Hive need to be reimplemented in Spark world. If
 an implementation hasn't been done, you see a gap. That's why you would
 expect functional disparity, not to mention future Hive futures.
 3. SparkSQL is far from production ready.
 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
 features and growing with Hive. Hive's operators are honored without
 re-interpretation. The integration is done at the execution layer, where
 Spark is nothing but an advanced MapReduce engine.
 5. Hive is aiming at enterprise use cases, where there are more important
 concerns such as security than purely if it works or if it runs fast. Hive
 on Spark certainly makes the query run faster, but still keeps the same
 enterprise-readiness.
 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
 needs to run some SQL. Or you're a casual SQL user and like to try
 something new.
 7. If haven't touched either Spark or Hive, I'd suggest you start with
 Hive, especially for an enterprise.
 8. If you're an existing Hive user and consider taking advantage of Spark,
 consider Hive on Spark.
 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
 SparkSQL includes a version of Hive, which is very likely at a different
 version of the Hive that you have (even if you don't use Hive on Spark).
 Library conflicts can put you in a nightmare.
 10. I haven't benchmarked SparkSQL myself, but I heard several reports
 that SparkSQL, when being tried at scale, is either fast or failing your
 queries.

 Hope this helps.

 Thanks,


 On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk 
 guoqing0...@yahoo.com.hk wrote:

 Hive on Spark and SparkSQL which should be better , and what are the key
 characteristics and the advantages and the disadvantages between ?

 --
 guoqing0...@yahoo.com.hk





How to compare data in two tables?

2015-04-27 Thread Alexander Pivovarov
Hi Everyone

Lets say I have hive table in 2 datacenters. Table format can be textfile
or Orc.
There is scoop job running every day which adds data to the table.

Each datacenter has its own instance of scoop job.
In Ideal case scenario the data in these two table should be the same.

The same means that row count is the same and tables contain the same rows.
However row order can be different. number of files and their size also can
be different.

Is there a way to scan the table and get some hashcode which can be used to
compare tables?

Thank you
Alex


RE: How to compare data in two tables?

2015-04-27 Thread Alexander Pivovarov
Golden source is Oracle DB.

Ihave two cases:

1. Tables are overwritten completly  every day.

2. Tables are incrementally loaded. PK is auto incremented number in Oracle.

What you think if I concat all cells of a row to a string. Get int hashcode
from the string.
And then sum hashcodes to get a final number for a table.


On Apr 27, 2015 1:45 PM, Mich Talebzadeh m...@peridale.co.uk wrote:

 Hi Alex,



 Am I correct that the source of data resides in a relational table and
that table has all the data already (the golden source) sent to both
instances of Hive? Is the data in Hive added incrementally daily with
“operation timestamp”  for each record? Also do you have a unique
identifier for each row in each table?



 HTH



 Mich Talebzadeh



 http://talebzadehmich.wordpress.com



 Author of the books A Practitioner’s Guide to Upgrading to Sybase ASE
15, ISBN 978-0-9563693-0-7.

 co-author Sybase Transact SQL Guidelines Best Practices, ISBN
978-0-9759693-0-4

 Publications due shortly:

 Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4,
volume one out shortly



 NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.



 From: Alexander Pivovarov [mailto:apivova...@gmail.com]
 Sent: 27 April 2015 21:27
 To: user@hive.apache.org
 Subject: How to compare data in two tables?



 Hi Everyone

 Lets say I have hive table in 2 datacenters. Table format can be textfile
or Orc.

 There is scoop job running every day which adds data to the table.

 Each datacenter has its own instance of scoop job.

 In Ideal case scenario the data in these two table should be the same.


 The same means that row count is the same and tables contain the same
rows.

 However row order can be different. number of files and their size also
can be different.



 Is there a way to scan the table and get some hashcode which can be used
to compare tables?

 Thank you

 Alex


Re: create table fails with exception unable to rename tmp file

2015-04-11 Thread Alexander Pivovarov
maybe user which runs hive cli does not have write permissions on
hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db

who is hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db owner?

what user runs hive cli?



On Sat, Apr 11, 2015 at 11:07 AM, Jie Zhang jiezh2...@gmail.com wrote:

 Hi,

 I hit the following error when running a CTAS statment. Looks like a hdfs
 permission issue since the temp file can not be renamed. Maybe I miss
 setting some property? my hive version is 0.14.0. Any help is appreciated!

 hive create table reporting.test2 as select count(uuid) from store.imr;

 Moving data to:
 hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db/test2

 Failed with exception Unable to rename:
 hdfs://zhangj05-a:8020/tmp/hive/zhangj/d79a6a8f-5e04-4bba-945f-eb6e010806b6/hive_2015-04-11_17-32-16_509_1852350843116431139-1/-ext-10001
 to: hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db/test2

 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.MoveTask


 Jessica



Re: Unsubscribe Me

2015-04-07 Thread Alexander Pivovarov
Ashish, Read The Friendly Manual below
https://hive.apache.org/mailing_lists.html

On Tue, Apr 7, 2015 at 2:15 PM, Ashish Garg gargcreation1...@gmail.com
wrote:

 Hello Admin,

 Please unsubscribe me.

 Regards,
 Ashish Garg



Re: adding a local jar for UDF test

2015-04-01 Thread Alexander Pivovarov
I can suggest 3 options

1. you can use JUnit test to test your UDF (e.g. TestGenericUDFLastDay)

2. you can create q file and test your UDF via mvn (look at udf_last_day.q)
mvn clean install -DskipTests -Phadoop-2
cd itest/qtest
mvn test -Dtest=TestCliDriver -Dqfile=udf_last_day.q
-Dtest.output.overwrite=true -Phadoop-2

3. you can build hive with dist profile and run hive cli locally
mvn clean install -DskipTests -Phadoop-2,dist
cd
packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin/
bin/hive

Maybe you can run hive using local engine but I decided to have hdfs and
yarn running locally

After you have hdfs and yarn running locally you can create hive-site.xml
similar to the following

sample hive config
$ echo $HIVE_CONF_DIR
/etc/hive/apivovarov-conf

$ cat /etc/hive/apivovarov-conf/hive-site.xml

?xml version=1.0 encoding=UTF-8 standalone=no?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration
  property
namehive.metastore.warehouse.dir/name
valuehdfs://localhost/apps/apivovarov/warehouse/value
  /property
  property
namejavax.jdo.option.ConnectionURL/name

valuejdbc:derby:;databaseName=/tmp/apivovarov_metastore_db;create=true/value
  /property
/configuration

On Wed, Apr 1, 2015 at 10:14 AM, Alex Bohr a...@gradientx.com wrote:

 Hi,
 I'm developing a new UDF.
 I want to be able to test the new jar as I develop without having to copy
 the Jar up to HDFS every time I change code and recompile the Jar.

 I'm using Hive CLI for testing, and adding this command:
 add jar 'file:///home/abohr/test/hive-udf-1.0-SNAPSHOT.jar';

 I've also used variations on the local file system protocol:
 add jar 'file://home/abohr/test/hive-udf-1.0-SNAPSHOT.jar';
 add jar 'file:/home/abohr/test/hive-udf-1.0-SNAPSHOT.jar';
 add jar '/home/abohr/test/hive-udf-1.0-SNAPSHOT.jar';

 but keep getting this error:
 Query returned non-zero code: 1,
 cause: '/home/abohr/test/hive-udf-1.0-SNAPSHOT.jar' does not exist.

 Is it not possible to test UDFs with a jar on the local filesystem?
 I don't want to add the jar to our configured aux-jars path that
 HiveServer2 reads from until I'm done testing.


 Thanks



Re: CamelCase using InitCap Function in Hive 0.13

2015-04-01 Thread Alexander Pivovarov
Vivek,

You can see the version in two places
1.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

stringinitcap(string A)Returns string, with the first letter of
each word in uppercase, all other letters in lowercase. Words are delimited
by whitespace. (As of Hive 1.1.0.)  if you click on 1.1.0 it will open
HIVE-3405 JIRA

2.  if you open https://issues.apache.org/jira/browse/HIVE-3405
it also says Fix Version/s: 1.1.0


HIVE-3405 code commit is here
https://github.com/apache/hive/commit/effa734360129524ac7628759cb4a8e8b5de05de

On Wed, Apr 1, 2015 at 10:38 AM, vivek veeramani 
vivek.veeraman...@gmail.com wrote:

 Thank you Sanjiv. I've reached out to our Infrastructure vendor to check
 what build we're on currently and if we can get the 1.1.0 version. But was
 just curious to know, is there a way we can see the build version?

 Thanks,
 Vivek

 On Wed, Apr 1, 2015 at 7:14 PM, @Sanjiv Singh sanjiv.is...@gmail.com
 wrote:


 Available in build   1.1.0
 https://issues.apache.org/jira/browse/HIVE/fixforversion/12329363

 JIRA :  https://issues.apache.org/jira/browse/HIVE-3405



 Regards
 Sanjiv Singh
 Mob :  +091 9990-447-339


 On Wed, Apr 1, 2015 at 6:24 PM, vivek veeramani 
 vivek.veeraman...@gmail.com wrote:

 Hi,

 I'm a relatively new user to Hive and was trying to format a column of
 String datatype from Uppercase to Camel-case. I could see the INITCAP()
 function in the language manual, and also could find related notes on JIRA
 stating it is available. But for some reason when I run my query it shows
 an invalid function error, and the show functions does not list the
 function. I'm running Hive 0.13. Is there any setting I need to enable
 this?

 Please advice what needs to be done in order to use this function.

 --
 Thanks in advance,
 Vivek Veeramani


 cell : +91-9632 975 975
 +91-9895 277 101





 --
 Thanks ,
 Vivek Veeramani


 cell : +91-9632 975 975
 +91-9895 277 101



Re: [ANNOUNCE] New Hive Committers - Jimmy Xiang, Matt McCline, and Sergio Pena

2015-03-23 Thread Alexander Pivovarov
Congrats to Matt, Jimmy and Sergio!

On Mon, Mar 23, 2015 at 11:30 AM, Chaoyu Tang ct...@cloudera.com wrote:

 Congratulations to Jimmy and Sergio!

 On Mon, Mar 23, 2015 at 2:08 PM, Carl Steinbach c...@apache.org wrote:

 The Apache Hive PMC has voted to make Jimmy Xiang, Matt McCline, and
 Sergio Pena committers on the Apache Hive Project.

 Please join me in congratulating Jimmy, Matt, and Sergio.

 Thanks.

 - Carl





Re: sorting in hive -- general

2015-03-07 Thread Alexander Pivovarov
sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
 ORDER BY performs a parallel total sort of the input (like that described
 in “Total Sort” on page 261). When a globally sorted result is not
 required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted result?,
 if the sort by operation produces a sorted file per reducer does that mean
 at the end of the sort all the reducer are put back together to give the
 correct results ?






Re: Create custom UDF

2015-03-05 Thread Alexander Pivovarov
Several useful common udf methods we added to GenericUDF recently
https://issues.apache.org/jira/browse/HIVE-9744


you can look at the following UDFs as an example:
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLevenshtein.java

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAddMonths.java

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFInitCap.java


On Tue, Mar 3, 2015 at 2:43 PM, Buntu Dev buntu...@gmail.com wrote:

 I couldn't find any official documentation on how to create a UDF and mvn
 dependencies for building the project except for this page:

  https://cwiki.apache.org/confluence/display/Hive/HivePlugins

 Can anyone help me with whats needed to construct the pom?


 Thanks!



Re: HS2 standalone JDBC jar not standalone

2015-03-02 Thread Alexander Pivovarov
yes, we even have a ticket for that
https://issues.apache.org/jira/browse/HIVE-9600

btw can anyone test jdbc driver with kerberos enabled?
https://issues.apache.org/jira/browse/HIVE-9599


On Mon, Mar 2, 2015 at 10:01 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 Heya,

 I've like to use jmeter against HS2/JDBC and I'm finding the standalone
 jar isn't actually standalone. It appears to include a number of
 dependencies but not Hadoop Common stuff. Is there a packaging of this jar
 that is actually standalone? Are there instructing for using this
 standalone jar as it is?

 Thanks,
 Nick

 jmeter.JMeter: Uncaught exception:  java.lang.NoClassDefFoundError:
 org/apache/hadoop/conf/Configuration
 at
 org.apache.hive.jdbc.HiveConnection.createBinaryTransport(HiveConnection.java:394)
 at
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:188)
 at
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:164)
 at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
 at java.sql.DriverManager.getConnection(DriverManager.java:571)
 at java.sql.DriverManager.getConnection(DriverManager.java:233)
 at
 org.apache.avalon.excalibur.datasource.JdbcConnectionFactory.init(JdbcConnectionFactory.java:138)
 at
 org.apache.avalon.excalibur.datasource.ResourceLimitingJdbcDataSource.configure(ResourceLimitingJdbcDataSource.java:311)
 at
 org.apache.jmeter.protocol.jdbc.config.DataSourceElement.initPool(DataSourceElement.java:235)
 at
 org.apache.jmeter.protocol.jdbc.config.DataSourceElement.testStarted(DataSourceElement.java:108)
 at
 org.apache.jmeter.engine.StandardJMeterEngine.notifyTestListenersOfStart(StandardJMeterEngine.java:214)
 at
 org.apache.jmeter.engine.StandardJMeterEngine.run(StandardJMeterEngine.java:336)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.conf.Configuration
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more



Re: how to access array type?

2015-03-02 Thread Alexander Pivovarov
hive create table test1 (c1 arrayint) row format delimited collection
items terminated by ',';
OK

hive insert into test1 select array(1,2,3) from dual;
OK

hive select * from test1;
OK
[1,2,3]

hive select c1[0] from test1;
OK
1

$ hadoop fs -cat /apps/hive/warehouse/test1/00_0
1,2,3



On Sun, Mar 1, 2015 at 11:53 PM, Jie Zhang jiezh2...@gmail.com wrote:

 Hi,

 I am trying to use hive complex data type on hive0.14.0. However, could
 not access the array type as manual indicated. I have an array column, but
 hit SemanticException when access the individual item in the array. Any
 clue? Did I use the wrong syntax or miss some property setting? Thanks!

 hive create table test1 (c1 array int) row format delimited collection
 items terminated by ',';

 OK

 Time taken: 0.092 seconds

 hive select * from test1;

 OK

 [1,2,3]

 Time taken: 0.065 seconds, Fetched: 1 row(s)

 hive select c1[0] from test1;

 FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or
 column reference 'c1[0]': (possible column names are: c1)

 Jessica



get max partition column value

2015-02-25 Thread Alexander Pivovarov
Hi Everyone

Lets say I have a table partitioned by period string

how to select max period?

if I run
select max(period) from invoice;

hive 0.13.1 runs MR which is slow

OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Reducer 2 - Map 1 (SIMPLE_EDGE)
  DagName: lcapp_20150225120606_0242d9fa-9ad5-40d4-b63c-87a1a8330482:1
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: invoice
  Statistics: Num rows: 1261837038 Data size: 628971470848
Basic stats: COMPLETE Column stats: COMPLETE
  Select Operator
expressions: period (type: string)
outputColumnNames: period
Statistics: Num rows: 1261837038 Data size:
628971470848 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
  aggregations: max(period)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 84 Basic stats:
COMPLETE Column stats: COMPLETE
  Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 84 Basic stats:
COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: string)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: max(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 84 Basic stats: COMPLETE
Column stats: COMPLETE
Select Operator
  expressions: _col0 (type: string)
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 84 Basic stats:
COMPLETE Column stats: COMPLETE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 84 Basic stats:
COMPLETE Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1

Time taken: 1.492 seconds, Fetched: 52 row(s)


Re: [ANNOUNCE] New Hive PMC Member - Sergey Shelukhin

2015-02-25 Thread Alexander Pivovarov
Congrats!

On Wed, Feb 25, 2015 at 12:33 PM, Vaibhav Gumashta 
vgumas...@hortonworks.com wrote:

 Congrats Sergey!

 On 2/25/15, 9:06 AM, Vikram Dixit vik...@hortonworks.com wrote:

 Congrats Sergey!
 
 On 2/25/15, 8:43 AM, Carl Steinbach c...@apache.org wrote:
 
 I am pleased to announce that Sergey Shelukhin has been elected to the
 Hive
 Project Management Committee. Please join me in congratulating Sergey!
 
 Thanks.
 
 - Carl
 




Re: CSV file reading in hive

2015-02-13 Thread Alexander Pivovarov
hive csv serde is available for all hive versions

https://github.com/ogrodnek/csv-serde


DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  
DEFAULT_SEPARATOR,


add jar path/to/csv-serde.jar;   (or put it to hive/hadoop/mr
classpath on all boxes on cluster)

-- you can use custom separator/quote/escape

create table my_table(a string, b string, ...)
 row format serde 'com.bizo.hive.serde.csv.CSVSerde'
 with serdeproperties (
   separatorChar = \t,
   quoteChar = ',
   escapeChar= \\
  )
 stored as textfile
;



On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote:

  Hi All,

 How all of you are creating hive/Impala table when the CSV file has some
 values with COMMA in between. it is like

 sree,12345,payment made,but it is not successful





 I know opencsv serde is there but it is not available in lower versions of
 Hive 14.0





Re: [ANNOUNCE] New Hive Committers -- Chao Sun, Chengxiang Li, and Rui Li

2015-02-09 Thread Alexander Pivovarov
Congrats!

On Mon, Feb 9, 2015 at 12:31 PM, Carl Steinbach c...@apache.org wrote:

 The Apache Hive PMC has voted to make Chao Sun, Chengxiang Li, and Rui Li
 committers on the Apache Hive Project.

 Please join me in congratulating Chao, Chengxiang, and Rui!

 Thanks.

 - Carl




Re: Hiveserver2 memory / thread leak v 0.13.1 (hdp-2.1.5)

2015-02-06 Thread Alexander Pivovarov
Is HIVE-7353 relates to embedded MetaStore only?
We use MySQL for metadata.

HS2 OOM yesterday

Current HS2 instance uses just 77MB for now (39 live threads).

I found another issue fixes HS2 leaks
https://issues.apache.org/jira/browse/HIVE-9234

JVM Arguments:
-Xmx1024m
-Djava.net.preferIPv4Stack=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=9010
-Dhadoop.log.dir=/var/log/hadoop/svc-hive
-Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str=svc-hive
-Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml
-Djava.net.preferIPv4Stack=true
-Xmx1024m
-Xmx4096m
-Dhadoop.security.logger=INFO,NullAppender

On Fri, Feb 6, 2015 at 1:48 AM, Vaibhav Gumashta vgumas...@hortonworks.com
wrote:

  Alexander,

  Can you share the jmap histo (or even better, a heapdump)? What are the
 top consumers? What are the  heap+permgen sizes that the JVM is configured
 to use?

  FYI, we fixed this memory leak in Hive-14:
 https://issues.apache.org/jira/browse/HIVE-7353.

  Thanks,
 —Vaibhav

   From: Alexander Pivovarov apivova...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Wednesday, February 4, 2015 at 6:03 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Hiveserver2 memory / thread leak v 0.13.1 (hdp-2.1.5)

 I'm using hive-0.13.1 (hdp-2.1.5)

 Hiveserver2 is running for 6 days

  nobody is connected to the server now.
 last connection was 30 min ago.

 Used memory: 688MB   (after GC)
  Live threads: 297
  perm: 99.6% used

 Two days ago numbers were
 Used memory: 259MB   (after GC)
 Live threads: 108
  perm: 98% used

  Looks like a memory/thread leak.

  Today and 2 days ago VisualVM screenshots are attached




Re: Re: How to query data by page in Hive?

2015-02-05 Thread Alexander Pivovarov
ROW_NUMBER doc
http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions144.htm#SQLRF06100

On Thu, Feb 5, 2015 at 4:48 PM, r7raul1...@163.com r7raul1...@163.com
wrote:

 *Table structure :*
  CREATE TABLE `u_data`(
 `userid` int,
 `movieid` int,
 `rating` int,
 `unixtime` string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 STORED AS INPUTFORMAT
 'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
 LOCATION
 'hdfs://localhost:8020/user/hive/warehouse/u_data'
 TBLPROPERTIES (
 'COLUMN_STATS_ACCURATE'='true',
 'numFiles'='1',
 'numRows'='0',
 'rawDataSize'='0',
 'totalSize'='1979173',
 'transient_lastDdlTime'='1421076916')

 *columns :*
movieid

 --
 r7raul1...@163.com


 *From:* Devopam Mittra devo...@gmail.com
 *Date:* 2015-02-05 18:48
 *To:* user@hive.apache.org
 *Subject:* Re: Re: How to query data by page in Hive?
 Please provide a valid table structure and the columns you wish to pick
 and I shall email you the query directly


 regards
 Devopam

 On Thu, Feb 5, 2015 at 3:20 PM, r7raul1...@163.com r7raul1...@163.com
 wrote:

 Thank you Devopam! Could you show me a  example?

 --
 r7raul1...@163.com


 *From:* Devopam Mittra devo...@gmail.com
 *Date:* 2015-02-05 18:05
 *To:* user@hive.apache.org
 *Subject:* Re: How to query data by page in Hive?
 You may want to use a ROW_NUMBER OR RANK / DENSE RANK in the inner query
 and then select only a subset of it in the outer query to control
 pagination. Based on your need, you may want to order the records as well ..

 Alternatively you may want to use CTE(
 https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression)
 for selecting the data in one go and then use row number to select as in
 previous case.

 regards
 Devopam

 On Thu, Feb 5, 2015 at 1:31 PM, r7raul1...@163.com r7raul1...@163.com
 wrote:

 Hello,
  How to query data by page in Hive?

 hive select * from u_data a limit 1,2;
 FAILED: ParseException line 1:31 missing EOF at ',' near '1'

 --
 r7raul1...@163.com




 --
 Devopam Mittra
 Life and Relations are not binary




 --
 Devopam Mittra
 Life and Relations are not binary




Hive wiki write access

2015-02-02 Thread Alexander Pivovarov
Hi Everyone

Can I get write access to hive wiki?
I need to put descriptions for several UDFs added recently (init_cap,
add_months, last_day, greatest, least)

Confluence username: apivovarov

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


Re: Hive wiki write access

2015-02-02 Thread Alexander Pivovarov
Thank you, Lefty!

On Mon, Feb 2, 2015 at 3:59 PM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Done.  Welcome to the Hive wiki team, Alexander!

 -- Lefty

 On Mon, Feb 2, 2015 at 2:14 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

 Hi Everyone

 Can I get write access to hive wiki?
 I need to put descriptions for several UDFs added recently (init_cap,
 add_months, last_day, greatest, least)

 Confluence username: apivovarov

 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF





Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Alexander Pivovarov
I like Tez engine for hive (aka Stinger initiative)

- faster than MR engine. especially for complex queries with lots of nested
sub-queries
- stable
- min latency is 5-7 sec  (0 sec for select count(*) ...)
- capable to process huge datasets (not limited by RAM as Spark)


On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks samuelma...@gmail.com wrote:

 Maybe you're right, and what I should be doing is throwing in connectors
 so that data from regular databases is pushed into HDFS at regular
 intervals, wherein my fancier analytics can be run across larger
 data-sets.

 However, I don't want to decide straightaway, for example, Phoenix + Spark
 may be just the combination I am looking for.

 Best,


 Samuel Marks
 http://linkedin.com/in/samuelmarks

 On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hallo,

 I think you have to think first about your functional and non-functional
 requirements. You can scale normal SQL databases as well (cf CERN or
 Facebook). There are different types of databases for different purposes -
 there is no one fits it all. At the moment, we are a few years away from a
 one-fits-it-all database that leverages AI etc to automatically scale,
 optimize etc processing, storage and network.  Until then you will have to
 do the math depending on your requirements.
 Once you make them more precise, we will able to help you more.

 Cheers
 Le 2 févr. 2015 06:08, Samuel Marks samuelma...@gmail.com a écrit :

 Well what I am seeking is a Big Data database that can work with Small
 Data also. I.e.: scaleable from one node to vast clusters; whilst
 maintaining relatively low latency throughout.

 Which fit into this category?

 Samuel Marks
 http://linkedin.com/in/samuelmarks





Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Alexander Pivovarov
Apache Phoenix is super fast for queries which filters data by table key,
- sub-second latency
- has good jdbc driver

but has limitations
- no full outer join support
- inner and left outer join use one computer memory, so it can not join
huge table to huge table


On Mon, Feb 2, 2015 at 6:59 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 I like Tez engine for hive (aka Stinger initiative)

 - faster than MR engine. especially for complex queries with lots of
 nested sub-queries
 - stable
 - min latency is 5-7 sec  (0 sec for select count(*) ...)
 - capable to process huge datasets (not limited by RAM as Spark)


 On Mon, Feb 2, 2015 at 6:00 PM, Samuel Marks samuelma...@gmail.com
 wrote:

 Maybe you're right, and what I should be doing is throwing in connectors
 so that data from regular databases is pushed into HDFS at regular
 intervals, wherein my fancier analytics can be run across larger
 data-sets.

 However, I don't want to decide straightaway, for example, Phoenix +
 Spark may be just the combination I am looking for.

 Best,


 Samuel Marks
 http://linkedin.com/in/samuelmarks

 On Mon, Feb 2, 2015 at 5:14 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hallo,

 I think you have to think first about your functional and non-functional
 requirements. You can scale normal SQL databases as well (cf CERN or
 Facebook). There are different types of databases for different purposes -
 there is no one fits it all. At the moment, we are a few years away from a
 one-fits-it-all database that leverages AI etc to automatically scale,
 optimize etc processing, storage and network.  Until then you will have to
 do the math depending on your requirements.
 Once you make them more precise, we will able to help you more.

 Cheers
 Le 2 févr. 2015 06:08, Samuel Marks samuelma...@gmail.com a écrit :

 Well what I am seeking is a Big Data database that can work with Small
 Data also. I.e.: scaleable from one node to vast clusters; whilst
 maintaining relatively low latency throughout.

 Which fit into this category?

 Samuel Marks
 http://linkedin.com/in/samuelmarks






Re: Sequence file compression in Hive

2013-06-10 Thread Alexander Pivovarov
Sachin, it works

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;
create table data1_seq STORED AS SEQUENCEFILE as select * from date1;


hadoop fs -cat /user/hive/warehouse/data1_seq/00_0 | less
SEQ^Forg.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^A'org.apache.hadoop.io.compress.GzipCodec^


On Mon, Jun 10, 2013 at 11:15 AM, Stephen Sprague sprag...@gmail.comwrote:


 On Mon, Jun 10, 2013 at 12:48 AM, Sachin Sudarshana 
 sachin.had...@gmail.com wrote:

 what's the header of the first sequence file look like?

 *$ dfs -cat
 /user/hive/warehouse/facts_520.db/test3facts520_gzip_seq/00_0 | head
 *





Re: Need rank(), can't build m6d's version

2013-04-01 Thread Alexander Pivovarov
http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/


On Mon, Apr 1, 2013 at 3:45 PM, Keith Wiley kwi...@keithwiley.com wrote:

 I need rank() in Hive.  I have't had much luck with Edward Capriolo's on
 git and it comes with no documentation.  It depends on hive-test (also by
 Edward) and I can't get maven to build the tests for hive-test which
 prevents me from building the hive-test package (the jar)...which is all
 pretty tangential to the real goal, which is rank of course.

 Is there any place to download the hive-rank jar pre-built instead of
 having to build it out of git?  If not, has anyone built it recently, or
 namely has anyone built the hive-test project it depends on?  The following
 is just one example of the sort of error I see.  Remember this error is on
 hive-test, which frankly I don't even care about.  :-)

 Feeling a little desperate...thanks for any help.


 ---
 Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest

 ---
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec
  FAILURE!
 testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed:
 7.823 sec   FAILURE!
 java.lang.AssertionError: expected:0 but was:9
 at org.junit.Assert.fail(Assert.java:91)
 at org.junit.Assert.failNotEquals(Assert.java:645)
 at org.junit.Assert.assertEquals(Assert.java:126)
 at org.junit.Assert.assertEquals(Assert.java:470)
 at org.junit.Assert.assertEquals(Assert.java:454)
 at
 com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:243)
 at junit.framework.TestSuite.run(TestSuite.java:238)
 at
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 at
 org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
 at
 org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
 at
 org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
 at $Proxy0.invoke(Unknown Source)
 at
 org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
 at
 org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
 at
 org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I do not feel obliged to believe that the same God who has endowed us with
 sense, reason, and intellect has intended us to forgo their use.
--  Galileo Galilei

 




Re: Function definition in hive

2013-02-22 Thread Alexander Pivovarov
https://cwiki.apache.org/Hive/hiveplugins.html

Creating Custom UDFs

First, you need to create a new class that extends UDF, with one or more
methods named evaluate.

package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
  public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
  }
}

(Note that there's already a built-in function for this, it's just an easy
example).

After compiling your code to a jar, you need to add this to the hive
classpath. See the section below on deploying jars.

Once hive is started up with your jars in the classpath, the final step is
to register your function:

create temporary function my_lower as 'com.example.hive.udf.Lower';

Now you can start using it:

hive select my_lower(title), sum(freq) from titles group by my_lower(title);

...

Ended Job = job_200906231019_0006
OK
cmo 13.0
vp  7.0

For a more involved example, see this
pagehttps://cwiki.apache.org/Hive/genericudafcasestudy.html
.
Deploying jars for User Defined Functions and User Defined SerDes

In order to start using your UDF, you first need to add the code to the
classpath:


hive add jar my_jar.jar;
Added my_jar.jar to class path

By default, it will look in the current directory. You can also specify a
full path:

hive add jar /tmp/my_jar.jar;
Added /tmp/my_jar.jar to class path

Your jar will then be on the classpath for all jobs initiated from that
session. To see which jars have been added to the classpath you can use:

hive list jars;
my_jar.jar



On Fri, Feb 22, 2013 at 5:09 PM, Yu PENG loneknigh...@gmail.com wrote:

 Hi,

 I am a newbie for Hive. I was assigned a job to do a lot of queries with
 similar structure. I just want to ask if there is a way to define functions
 like other programming languages so that I can call the function with
 different parameters. This will be really helpful. Does anyone know that?

 Thanks,
 Yu Peng





Re: Join not working in HIVE

2012-12-17 Thread Alexander Pivovarov
Hive supports only equi-join

I recommend you to read some hive manual before use it. (e.g.
http://hive.apache.org/docs/r0.9.0/language_manual/joins.html
https://cwiki.apache.org/Hive/languagemanual-joins.html)
on the first sentence it says Only equality joins, outer joins, and left
semi joins are supported in Hive
Hive has certain limitations. it also supports map side join, semi join.
You'd better know what they are.





On Mon, Dec 17, 2012 at 6:18 AM, Philip Tromans
philip.j.trom...@gmail.comwrote:

 Hive doesn't support theta joins. Your best bet is to do a full cross join
 between the tables, and put your range conditions into the WHERE clause.
 This may or may not work, depending on the respective sizes of your tables.

 The fundamental problem is that parallelising a theta (or range) join via
 Map-Reduce is not trivial, and Hive has no support for it.

 Cheers,

 Phil.


 On 17 December 2012 13:55, Nitin Pawar nitinpawar...@gmail.com wrote:

 can you explain your needs? may be there is another alternate way
 a query is not of much help




 On Mon, Dec 17, 2012 at 7:17 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 We are trying to build a tree structure in a table.. hence we have the
 left and right limits...
 Can't use where clause in that..

 regards,
 Rams

 On Mon, Dec 17, 2012 at 6:53 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 hive is not mysql  :)


 On Mon, Dec 17, 2012 at 6:50 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 But it is working fine in MySql...

 mysql select count(A1.id) as LVL, A2.id, A2.code, A2.short_name,
 A2.lft, A2.rgt from product A1 join product A2 on (A1.lft = A2.lft and
 A1.rgt = A2.rgt) group by A2.id, A2.code, A2.short_name, A2.lft, A2.rgt;
 +-+--+--+--+--+-+
 | LVL | id   | code | short_name   | lft  | rgt |
 +-+--+--+--+--+-+

 |   1 |1 | 4| Treasury Service |1 | 100 |
 |   2 |2 | 2| Root |2 |1000 |
 |   2 |3 | Z| CKC  | 1001 |2000 |
 |   2 |4 | A| Treasury Service | 2001 |3000 |
 |   3 |5 | OOAQ | CODE CASH MANAGEMENT |3 | 100 |
 |   3 |6 | YP00 | JPMC Treasury|  101 | 200 |
 |   3 |7 | 432  | Treasury Service | 1002 |1100 |
 +-+--+--+--+--+-+


 regards,
 Rams

 On Mon, Dec 17, 2012 at 6:33 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 select count(A1.id) as LVL, A2.id, A2.code, A2.short_name, A2.lft,
 A2.rgt from vprd A1 join vprd A2 on (A1.lft = A2.lft and A1.rgt = 
 A2.rgt)
 where A.2rgt=1 and A2.lft=1 group by A2.id, A2.code, A2.short_name, 
 A2.lft,
 A2.rgt





 --
 Nitin Pawar





 --
 Nitin Pawar





Re: best way to load millions of gzip files in hdfs to one table in hive?

2012-10-02 Thread Alexander Pivovarov
Options
1. create table and put files under the table dir

2. create external table and point it to files dir

3. if files are small then I recomend to create new set of files using
simple MR program and specifying number of reduce tasks. Goal is to make
files size  hdfs block size (it safes NN memory and read will be faster)


On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang zuo...@gmail.com wrote:

 I have millions of gzip files in hdfs (with the same fields), would like
 to load them into one table in hive with a specified schema.
 What is the most efficient ways to do that?
 Given that my data is only in hdfs, and also gzipped, does that mean I
 could just simply set up the table somehow bypassing some unnecessary
 overhead of the typical approach?

 Thanks!