Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-01 Thread Jörn Franke
Hallo, I think you have to think first about your functional and non-functional requirements. You can scale normal SQL databases as well (cf CERN or Facebook). There are different types of databases for different purposes - there is no one fits it all. At the moment, we are a few years away from

Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Jörn Franke
Additionally, although it is a PoC you should have a realistic data model. Furthermore, following good data modeling practices should be taken into account. Joining on a double is not one of them. It should be int. Furthermore, double is a type that is in most scenarios rarely used. In the

Re: Hive on Tez much slower than MR

2015-08-06 Thread Jörn Franke
Always use the newest version of Hive. You should use orc or parquet wherever possible. If you use orc then you should explicitly enable storage indexes and insert your table sorted (eg for the query below you would sort on x). Additionally you should enable statistics. Compression may bring

Re: Persistent (and possibly asynchronous) Hive access from within Scala

2015-08-07 Thread Jörn Franke
I have no problems to use jdbc for hiveserver2. I think you need the hive*jdbc*standalone.jar and i think hadoop-commons*.jar Le ven. 7 août 2015 à 5:23, Stephen Bly stephene...@gmail.com a écrit : What library should I use if I want to make persistent connections from within Scala/Java? I’m

Re: Error starting the Hive Shell

2015-08-13 Thread Jörn Franke
Maybe there is another older log4j library in the classpath? Le ven. 14 août 2015 à 5:34, Praveen Sripati praveensrip...@gmail.com a écrit : Hi, I installed Java 1.8.0_51, Hadoop 1.2.1 and Hive 1.2.1 on Ubuntu 14.04 64 bit, I do get the below exception when I start the hive shell or the

Re: clarification please

2015-10-29 Thread Jörn Franke
> On 29 Oct 2015, at 06:43, Ashok Kumar wrote: > > hi gurus, > > kindly clarify the following please > > Hive currently does not support indexes or indexes are not used in the query Not correct. See https://snippetessay.wordpress.com > The lowest granularity for

Re: Hive and HBase

2015-11-10 Thread Jörn Franke
Probably it is outdated. Hive can access hbase tables via external tables. The execution engine in Hive can be mr, tez, spark. Hiveql is nowadays very similar to sql . In fact, Hortonworks plans to make it sql2011:analytics compatible. Hbase can be accessed independently of Hive via sql using

Re: Hive Insert taking a lot of time

2015-11-02 Thread Jörn Franke
What is the create table statement? You may want to insert everything into the orc table (sorted on x and/or y) and then apply the where statement in your queries on the orc table. > On 02 Nov 2015, at 13:36, Kashif Hussain wrote: > > Hi, > I am trying to insert data

Re: Min-Max Index vs Bloom filter

2015-11-02 Thread Jörn Franke
Bloom Filter only works for = and min max for <>= , however the latter only works for numeric value while the bloom filter nearly works on all types. Additionally the bloom filter is a probabilistic data structure. For both it make sense that the data is sorted on the column which is most

Re: Hive alternatives?

2015-11-05 Thread Jörn Franke
First it depends on what you want to do exactly. Second, Hive > 1.2, Tez as an Execution Engine (I recommend >= 0.8) and Orc as storage format can be pretty quick depending on your use case. Additionally you may want to employ compression which is a performance boost once you understand how

Re: hive metastore update from 0.12 to 1.0

2015-11-03 Thread Jörn Franke
Probably you started the new Hive version before upgrading the schema. This means manual fixing. > On 03 Nov 2015, at 11:56, Sanjeev Verma wrote: > > Hi > > I am trying to update the metastore using schematool but getting error > > schematool -dbType derby

Re: Best way to load CSV file into Hive

2015-10-31 Thread Jörn Franke
You clearly need to escape those characters as for any other tool. You may want to use avro instead of csv , xml or JSON etc > On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy > wrote: > > Hi, > > I have a CSV file which contains hunderd

Re: Hive 1.2.1 installation troubleshooting - No known driver to handle "jdbc://hive2://:10000"

2015-10-08 Thread Jörn Franke
You could edit the beeline script and add the driver there to the classpath Le jeu. 8 oct. 2015 à 16:02, Timothy Garza a écrit : > I’ve installed Hive 1.2.1 on Amazon Linux AMI release 2015.03, master-node > of Hadoop cluster. > > > > I can successfully access

Re: HiveMetaStoreClient

2015-08-26 Thread Jörn Franke
Why not use hcatalog web service api? Le mer. 26 août 2015 à 18:44, Jerrick Hoang jerrickho...@gmail.com a écrit : Ok, I'm super confused now. The hive metastore is a RDBMS database. I totally agree that I shouldn't access it directly via jdbc. So what about using this class

Re: HiveMetaStoreClient

2015-08-26 Thread Jörn Franke
What about using the hcatalog apis? Le mer. 26 août 2015 à 8:27, Jerrick Hoang jerrickho...@gmail.com a écrit : Hi all, I want to interact with HiveMetaStore table from code and was looking at

Re: Getting dot files for DAGs

2015-09-30 Thread Jörn Franke
Why not use tez ui? Le jeu. 1 oct. 2015 à 2:29, James Pirz a écrit : > I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries. > I am interested in checking DAGs for my queries visually, and I realized > that I can do that by graphviz once I can get "dot" files of my DAGs.

Re: HiveServer with LDAP

2015-09-19 Thread Jörn Franke
What do you mean by it is not working? You may also check the logs of your lap server... Maybe there is also a limitations of number of logins in your lap server... Maybe the account is temporarily blocked because you entered the password wrongly too many times... Le ven. 18 sept. 2015 à 10:34,

Re: Using spark in tandem with Hive

2015-12-01 Thread Jörn Franke
http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be under

Re: how to get counts as a byproduct of a query

2015-12-02 Thread Jörn Franke
I am not sure if I understand, but why this should not be possible using SQL in hive? > On 02 Dec 2015, at 21:26, Frank Luo wrote: > > Didn’t get any response, so trying one more time. I cannot believe I am the > only one facing the problem. > > From: Frank Luo >

Re: Handling LZO files

2015-12-04 Thread Jörn Franke
for analytics the ORC or parquet format. > On 03 Dec 2015, at 15:28, Jörn Franke <jornfra...@gmail.com> wrote: > > Your Hive version is too old. You may want to use also another execution > engine. I think your problem might then be related to external tables for > which

Re: Hive Support for Unicode languages

2015-12-04 Thread Jörn Franke
What operating system are you using? > On 04 Dec 2015, at 01:25, mahender bigdata > wrote: > > Hi Team, > > Does hive supports Hive Unicode like UTF-8,UTF-16 and UTF-32. I would like to > see different language supported in hive table. Is there any serde which

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
r PARQUET, requires us to load 5 years of LZO data in ORC or > PARQUET format. Though it might be performance efficient, it increases data > redundancy. > But we will explore that option. > > Currently I want to understand when I am unable to scale up mappers. > > Thanks,

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
How many nodes, cores and memory do you have? What hive version? Do you have the opportunity to use tez as an execution engine? Usually I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics. This is much more performant than

Re: Using spark in tandem with Hive

2015-12-01 Thread Jörn Franke
How did you create the tables? Do you have automated statistics activated in Hive? Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive. > On 01 Dec 2015, at 17:40, Mich Talebzadeh

Re: Error

2015-12-16 Thread Jörn Franke
Do you have the create table statement? The sqoop command ? > On 17 Dec 2015, at 07:13, Trainee Bingo wrote: > > Hi All, > > I have a sqoop script which brings data from oracle and dumps it to HDFS. > Then that data is exposed to hive external table. But when I do : >

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Jörn Franke
Hive has the export/import commands, alternatively Falcon+oozie > On 17 Dec 2015, at 17:21, Elliot West wrote: > > Hello, > > I'm thinking about the steps required to repeatedly push Hive datasets out > from a traditional Hadoop cluster into a parallel cloud based cluster.

Re: Loading data from HDFS to hive and leading to many NULL value in hive table

2015-12-15 Thread Jörn Franke
You forgot to tell Hive that the file is comma-separated. You may want to use the CSV serde. > On 16 Dec 2015, at 07:15, zml张明磊 wrote: > > I am confusing about the following result. Why the hive table has so many > NULL value ? > > hive> select * from managers; > OK

Re: Indexes in Hive

2016-01-06 Thread Jörn Franke
I am not sure how much performance one could gain in comparison to ORC or Parquet. They work pretty well once you know how to use them. However, there is still ways to optimize them. For instance, sorting of data is a key factor for these formats to be efficient. Nevertheless, if you have a lot of

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
You can still use execution Engine mr for maintaining the index. Indeed with the ORC or parquet format there are min/max indexes and bloom filters, but you need to sort your data appropriately to benefit from performance. Alternatively you can create redundant tables sorted in different order.

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
Btw this is not Hive specific, but also for other relational database systems, such as Oracle Exadata. > On 05 Jan 2016, at 20:57, Jörn Franke <jornfra...@gmail.com> wrote: > > You can still use execution Engine mr for maintaining the index. Indeed with > the ORC

Re: Indexes in Hive

2016-01-05 Thread Jörn Franke
If I understand you correctly this could be just another Hive storage format. > On 06 Jan 2016, at 07:24, Mich Talebzadeh wrote: > > Hi, > > Thinking loudly. > > Ideally we should consider a totally columnar storage offering in which each > column of table is stored as

Re: Impact of partitioning on certain queries

2016-01-08 Thread Jörn Franke
gt; message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stat

Re: Impact of partitioning on certain queries

2016-01-07 Thread Jörn Franke
This observation is correct and it is the same behavior as you see it in other databases supporting partitions. Usually you should avoid many small partitions. > On 07 Jan 2016, at 23:53, Mich Talebzadeh wrote: > > Ok we hope that partitioning improves performance where

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-31 Thread Jörn Franke
You are using an old version of Spark and it cannot leverage all optimizations of Hive, so I think that your conclusion cannot be as easy as you might think. > On 31 Dec 2015, at 19:34, Mich Talebzadeh wrote: > > Ok guys. > > I have not succeeded in installing TEZ. Yet

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
gt;> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensu

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
>> >> >> >> Mich Talebzadeh >> >> >> >> Sybase ASE 15 Gold Medal Award 2008 >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-S

Re: Impact of partitioning on certain queries

2016-01-08 Thread Jörn Franke
is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 08 January 2016 08:49 > To:

Re: The advantages of Hive/Hadoop comnpared to Data Warehouse

2015-12-18 Thread Jörn Franke
I think you should draw more the attention that Hive is just one component in the ecosystem. You can have many more components, such as ELT, integrating unstructured data, machine learning, streaming data etc. however usually analysts are not aware about the technologies and it staff is not

Re: Executor getting killed when running Hive on Spark

2015-12-24 Thread Jörn Franke
Have you checked what the issue is with the log file causing troubles? Enough space available? Access rights (what is the user of the spark worker?)? Does directory exist? Can you provide more details how the table is created? Does the query work with mr or tez as an execution engine? Does a

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark. I assume you use the latest Hive version. Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use

Re: Building Rule Engine/ Rule Transformation

2015-11-29 Thread Jörn Franke
Why not implement Hive UDF in Java? > On 28 Nov 2015, at 21:26, Mahender Sarangam > wrote: > > Hi team, > > We need expert input to discuss how to implement Rule engine in hive. Do you > have any references available to implement rule in hive/pig. > > > We

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-20 Thread Jörn Franke
I recommend to use a Hadoop distribution containing these technologies. I think you get also other useful tools for your scenario, such as Auditing using sentry or ranger. > On 20 Nov 2015, at 10:48, Mich Talebzadeh wrote: > > Well > > “I'm planning to deploy Hive on

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-20 Thread Jörn Franke
>> On Fri, Nov 20, 2015 at 5:22 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> I recommend to use a Hadoop distribution containing these technologies. I >> think you get also other useful tools for your scenario, such as Auditing >> using sentry or ranger. >

Re: insert query in hive

2016-06-08 Thread Jörn Franke
This is not the recommended way to load large data volumes into Hive. Check the external table feature, scoop, and the Orc/parquet formats > On 08 Jun 2016, at 14:03, raj hive wrote: > > Hi Friends, > > I have to insert the data into hive table from Java program. Insert

Re: Internode Encryption with HiveServer2

2016-06-03 Thread Jörn Franke
This can be configured on the Hadoop level. > On 03 Jun 2016, at 10:59, Nick Corbett wrote: > > Hi > > > I am deploying Hive in a regulated environment - all data needs to be > encrypted when transferred and at rest. > > > If I run a 'select' statement, using

Re: Convert date in string format to timestamp in table definition

2016-06-05 Thread Jörn Franke
Never use string when you can use int - the performance will be much better - especially for tables in Orc / parquet format > On 04 Jun 2016, at 22:31, Igor Kravzov wrote: > > Thanks Dudu. > So if I need actual date I will use view. > Regarding partition column: I

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Jörn Franke
Both have outdated versions, usually one can support you better if you upgrade to the newest. Firewall could be an issue here. > On 26 May 2016, at 10:11, Nikolay Voronchikhin > wrote: > > Hi PySpark users, > > We need to be able to run large Hive queries in PySpark

Re: Copying all Hive tables from Prod to UAT

2016-05-26 Thread Jörn Franke
Or use Falcon ... The Spark JDBC I would try to avoid. Jdbc is not designed for these big data bulk operations, eg data has to be transferred uncompressed and there is the serialization/deserialization issue query result -> protocol -> Java objects -> writing to specific storage format etc

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
blem is that the TEZ user group is exceptionally >>> quiet. Just sent an email to Hive user group to see anyone has managed to >>> built a vendor independent version. >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
divided on this (use Hive with TEZ) or use Impala instead of Hive > etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Jörn Franke
gt; interface to Hive. Are you saying that the reference command line interface > is not efficiently implemented? :) > > -David Nies > >> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com>: >> >> Aside from this the low network performanc

Re: Hive indexes without improvement of performance

2016-06-16 Thread Jörn Franke
The indexes are based on HDFS blocksize, which is usually around 128 mb. This means for hitting a single row you must always load the full block. In traditional databases this blocksize it is much faster. If the optimizer does not pick up the index then you can query the index directly (it is

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this. > On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote: &

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Hallo, For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view. Export it as file to HDFS and

Re: if else condition in hive

2016-06-21 Thread Jörn Franke
I recommend you to rethink it as part of a bulk transfer potentially even using separate partitions. Will be much faster. > On 21 Jun 2016, at 13:22, raj hive wrote: > > Hi friends, > > INSERT,UPDATE,DELETE commands are working fine in my Hive environment after >

Re: Optimize Hive Query

2016-06-23 Thread Jörn Franke
The query looks a little bit too complex from what it is supposed to do. Can you reformulate and restrict the data in a where clause (highest restriction first). Another hint would be to use the Orc format (with indexes and optionally bloom filters) with snappy compression as well as sorting

Re: optimize joins in hive 1.2.1

2016-01-18 Thread Jörn Franke
Do you have some data model? Basically modern technologies, such as Hive, but also relational database, suggest to prejoin tables and working on big flat tables. The reason is that they are distributed systems and you should avoid transferring for each query a lot of data between nodes.

Re: Optimizing external table structure

2016-02-13 Thread Jörn Franke
How many disk drives do you have / node? Generally one node should have 12 drives not configured as raid and not configured as lvm. Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR). I am not sure about the latest status of

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke
Check HiveMall > On 03 Feb 2016, at 05:49, Koert Kuipers wrote: > > yeah but have you ever seen somewhat write a real analytical program in hive? > how? where are the basic abstractions to wrap up a large amount of operations > (joins, groupby's) into a single function

Re: Importing Oracle data into Hive

2016-01-31 Thread Jörn Franke
Well, you can create an empty Hive table in Orc format and use --hive-override in sqoop Alternatively you can use --hive-import and set hive.default.format I recommend to define the schema properly on the command line, because sqoop detection of formats is based on jdbc (Java) types which is

Re: Hive 2 performance

2016-02-24 Thread Jörn Franke
h.talebza...@cloudtechnologypartners.co.uk> wrote: > > well I meant how fast it returns the results in this case compare to 1.2.1 etc > > thanks > >> On 24/02/2016 17:25, Jörn Franke wrote: >> >> I am not sure what you are looking for. Performance has many influence >> factors..

Re: Hive 2 performance

2016-02-24 Thread Jörn Franke
I am not sure what you are looking for. Performance has many influence factors... > On 24 Feb 2016, at 18:23, Mich Talebzadeh > wrote: > > Hi, > > > > Has anyone got some performance matrix for Hive 2 from user perspective? > > It looks very

Re: ORC files and statistics

2016-01-19 Thread Jörn Franke
y 2016 17:57 > To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com> > Cc: Jörn Franke <jornfra...@gmail.com> > Subject: Re: ORC files and statistics > > On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: > Thank you

Re: ORC files and statistics

2016-01-19 Thread Jörn Franke
Just be aware that you should insert the data sorted at least on the most discrimating column of your where clause > On 19 Jan 2016, at 17:27, Owen O'Malley wrote: > > It has both. Each index has statistics of min, max, count, and sum for each > column in the row group of

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Jörn Franke
Why should it not be ok if you do not miss any functionality? You can use oozie + hive queries to have more sophisticated logging and scheduling. Do not forget to do proper capacity/queue management. > On 16 Feb 2016, at 07:19, Ramasubramanian > wrote: >

Re: Hive_CSV

2016-03-09 Thread Jörn Franke
The data is already in the csv so it is not matter for querying. It is recommend to convert it to ORC or Parquet for querying. > On 09 Mar 2016, at 19:09, Ajay Chander wrote: > > Daniel, thanks for your time. Is it like creating two tables, one is to get > all the data

Re: Hive_CSV

2016-03-09 Thread Jörn Franke
Why Don't you load all data and use just two columns for querying? Alternatively use regular expressions. > On 09 Mar 2016, at 18:43, Ajay Chander wrote: > > Hi Everyone, > > I am looking for a way, to ignore the first occurrence of the delimiter while > loading the

Re: ODBC drivers for Hive 2

2016-03-10 Thread Jörn Franke
Just out of curiosity: what is the code base for the odbc drivers by Hortonworks, cloudera & co? Did they develop them on their own? If yes, maybe one should think about an open source one, which is reliable and supports a richer set of Odbc functionality. Especially in the light of

Re: read-only mode for hive

2016-03-08 Thread Jörn Franke
What is the use case? You can try security solutions such as Ranger or Sentry. As already mentioned another alternative could be a view. > On 08 Mar 2016, at 21:09, PG User wrote: > > Hi All, > I have one question about putting hive in read-only mode. > > What are the

Re: Hive Metastore Bottleneck

2016-03-30 Thread Jörn Franke
Is the MySQL database virtualized? Bottlenecks to storage of the MySQL database? Network could be a bottleneck? Firewalls blocking new connections in case of a sudden connection increase? > On 30 Mar 2016, at 23:28, Udit Mehta wrote: > > Hi all, > > We are currently

Re: analyse command not working on decimal(38,0) datatype

2016-04-06 Thread Jörn Franke
Please provide exact log messages , create table statements, insert statements > On 06 Apr 2016, at 12:05, Ashim Sinha wrote: > > Hi Team > Need help for the issue > Steps followed > table created > Loaded the data of lenght 38 in decimal type > Analyse table - for columns

Re: De-identification_in Hive

2016-03-19 Thread Jörn Franke
What are your requirements? Do you need to omit a column? Transform it? Make the anonymized version joinable etc. there is not simply one function. > On 17 Mar 2016, at 14:58, Ajay Chander wrote: > > Hi Everyone, > > I have a csv.file which has some sensitive data in a

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
unt(*) from gprs where terminal_type = 25080; > select * from gprs where terminal_type = 25080; > > In the gprs table, the "terminal_type" column's value is in [0, 25066] > > Joseph > > From: Jörn Franke > Date: 2016-03-16 19:26 > To: Joseph > CC: use

Re: Issue joining 21 HUGE Hive tables

2016-03-24 Thread Jörn Franke
Joining so many external tables is always an issue with any component. Your problem is not Hive specific; but your data model seems to be messed up. First of all you should have them in an appropriate format, such as ORC or parquet and the tables should not be external. Then you should use the

Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke
If you check the newest Hortonworks distribution then you see that it generally works. Maybe you can borrow some of their packages. Alternatively it should be also available in other distributions. > On 26 Mar 2016, at 22:47, Mich Talebzadeh wrote: > > Hi, > > I am

Re: Hive and Impala

2016-03-02 Thread Jörn Franke
It always depends on what you want to do and thus from experience I cannot agree with your comment. Do you have any reasoning for this statement? > On 02 Mar 2016, at 19:14, Dayong wrote: > > Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive >

Re: Hive and Impala

2016-03-02 Thread Jörn Franke
I think you can always make a benchmark that has this and this result. You always have to see what is evaluated and generally I recommend to always try yourself for your data and your queries. There is also a lot of change within the projects. Impala may have Kudo, but Hive has ORC, Tez and

Re: Sqoop_Sql_blob_types

2016-04-27 Thread Jörn Franke
You could try as binary. Is it just for storing the blobs or for doing analyzes on them? In the first case you may think about storing them as files in HDFS and including in hive just a string containing the file name (to make analysis on the other data faster). In the later case you should

Re: Container out of memory: ORC format with many dynamic partitions

2016-04-30 Thread Jörn Franke
I would still need some time to dig deeper in this. Are you using a specific distribution? Would it be possible to upgrade to a more recent Hive version? However, having so many small partitions is a bad practice which seriously affects performance. Each partition should at least contain

Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Jörn Franke
I do not think you make it faster by setting the execution engine to Spark. Especially with such an old Spark version. For such simple things such as "dump" bulk imports and exports, it does matter much less if it all what execution engine you use. There was recently a discussion on that on the

Analyzing Bitcoin blockchain data with Hive

2016-04-29 Thread Jörn Franke
Dear all, I prepared a small Serde to analyze Bitcoin blockchain data with Hive: https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/ There are some example queries, but I will add some in the future. Additionally, more unit tests will be added. Let

Re: Hive and XML

2016-05-22 Thread Jörn Franke
XML is generally slow in any software. It is not recommended for large data volumes. > On 22 May 2016, at 10:15, Maciek wrote: > > Have you had to load XML data into Hive? Did you run into any problems or > experienced any pain points, e.g. complex schemas or performance? >

Re: Performance for hive external to hbase with serval terabyte or more data

2016-05-11 Thread Jörn Franke
Why don't you export the data from hbase to hive, eg in Orc format. You should not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). You can then do queries there. For large log file processing in real time, one alternative depending on your needs could be Solr on

Re: Query Failing while querying on ORC Format

2016-05-17 Thread Jörn Franke
I do not remember exactly, but I think it worked simply by adding a new partition to the old table with the additional columns. > On 17 May 2016, at 15:00, Mich Talebzadeh wrote: > > Hi Mahendar, > > That version 1.2 is reasonable. > > One alternative is to create

Re: HIVE on Windows

2016-05-18 Thread Jörn Franke
Use a distribution, such as Hortonworks > On 18 May 2016, at 19:09, Me To wrote: > > Hello, > > I want to install hive on my windows machine but I am unable to find any > resource out there. I am trying to set up it from one month but unable to > accomplish that.

Re: Mappers spawning Hive queries

2016-04-16 Thread Jörn Franke
Just out of curiosity, what is the use case behind this? How do you call the shell script? > On 16 Apr 2016, at 00:24, Shirish Tatikonda > wrote: > > Hello, > > I am trying to run multiple hive queries in parallel by submitting them > through a map-reduce job.

Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke
You could also explore the in-memory database of 12c . However, I am not sure how beneficial it is for Oltp scenarios. I am excited to see how the performance will be on hbase as a hive metastore. Nevertheless, your results on Oracle/SSD will be beneficial for the community. > On 17 Apr 2016,

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Hive has working indexes. However many people overlook that a block is usually much larger than in a relational database and thus do not use them right. > On 19 Apr 2016, at 09:31, Mich Talebzadeh wrote: > > The issue is that Hive has indexes (not index store) but

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Depends really what you want to do. Hive is more for queries involving a lot of data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion. I think the reason is that hive has been the entry point for many engines and formats. Additionally there is a lot of tuning capabilities

Re: Hive external indexes incorporation into Hive CBO

2016-04-21 Thread Jörn Franke
I am still not sure why you think they are not used. The main issue is that the block size is usually very large (eg 256 MB compared to kilobytes / sometimes few megabytes in traditional databases) and the indexes refer to blocks. This makes it less likely that you can leverage it for small

Re: Trouble trying to get started with hive

2016-07-11 Thread Jörn Franke
Please use a Hadoop distribution to avoid these configuration issues (in the beginning). > On 05 Jul 2016, at 12:06, Kari Pahula wrote: > > Hi. I'm trying to familiarize myself with Hadoop and various projects related > to it. > > I've been following >

Re: Doubt on Hive Partitioning.

2016-08-01 Thread Jörn Franke
It happens in old hive version of the filter is only in the where clause and NOT in the join clause. This should not happen in newer hive version. You can check it by executing explain dependency query. > On 01 Aug 2016, at 11:07, Abhishek Dubey wrote: > > Hi All,

Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> >> wrote: >> Is this partition pruning fixed in MR too except for TEZ in newer hive >> version? >> >> Regards, >> Q >> >>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Frank

Re: hive concurrency not working

2016-08-03 Thread Jörn Franke
You need to configure the yarn scheduler (fair or capacity depending on your needs) > On 03 Aug 2016, at 15:14, Raj hadoop wrote: > > Dear All, > > In need or your help, > > we have horton works 4 node cluster,and the problem is hive is allowing only > one user at a

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke
I think the comparison with Oracle rdbms and oracle times ten is not so good. There are times when the in-memory database of Oracle is slower than the rdbms (especially in case of Exadata) due to the issue that in-memory - as in Spark - means everything is in memory and everything is always

Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-13 Thread Jörn Franke
You can use use any Java function in Hive without (!) the need to wrap it in an UDF via the reflect command. however not sure if this meets your use case. Sent from my iPhone > On 13 Jul 2016, at 19:50, Markovitz, Dudu wrote: > > Hi > > I’m personally not aware of

Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-14 Thread Jörn Franke
> e.g. > > hive> select java_method ('java.lang.Math','min',45,9) ; > 9 > > I’m not sure how it serves out purpose. > > Dudu > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Thursday, July 14, 2016 8:55 AM > To: user@hive.apache.org > Subjec

  1   2   >