Re: Hive UDF accessing https request

2016-01-10 Thread Gopal Vijayaraghavan
> javax.net.ssl.SSLHandshakeException: >sun.security.validator.ValidatorException: PKIX path building failed: >sun.security.provider.certpath.SunCertPathBuilderException: unable to >find valid certification path to requested There's a linux package named ca-certificates(-java) which might be miss

Re: simple usage of stack UDTF causes a cast exception

2016-01-10 Thread Gopal Vijayaraghavan
> java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: >java.lang.ClassCastException: >org.apache.hadoop.hive.serde2.lazy.LazyString cannot be cast to >org.apache.hadoop.io.Text ... >at >org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObje >ctInspe

Re: Loading data containing newlines

2016-01-13 Thread Gopal Vijayaraghavan
> We are pushing the compressed text files into HDFS directory for Hive >EXTERNAL table, then using an INSERT on the table using ORC storage. We >are letting Hive handle the ORC file creation process. Are the compressed text files small enough to process one by one? I did write something similar

Re: Using python for hive table quering gives error

2016-01-15 Thread Gopal Vijayaraghavan
> import pyhs2 ... > thrift.Thrift.TApplicationException: Required field 'sessionHandle' is >unset! Struct:TExecuteStatementReq(sessionHandle:null, statement:USE >default, confOverlay:{}) That's a version mismatch in the thrift protocol layer (JDBC to be precise). PyHS2 is deprecated and unmain

Re: Using python for hive table quering gives error

2016-01-15 Thread Gopal Vijayaraghavan
> i could find all examples using pyhs2 only. The pyhs2 site points to dropbox/pyhive. But even that doesn't work for me unless I replace the TCLIService dir with the one generated by thrift-0.9.2 (after HIVE-8829). Maybe you might have better luck, depending on the exact version of HiveServer2

Re: Loading data containing newlines

2016-01-15 Thread Gopal Vijayaraghavan
> You can open a file as an RDD of lines, and map whatever custom >tokenisation function you want over it; That's what a SerDe does in Hive (like OpenCSVSerDe). Once your record gets split into multiple lines, then the problem becomes more complex since Spark's functional nature demands side-eff

Re: what is the difference between ³hive.compute.splits.in.am=true²and "hive.compute.splits.in.am=false"

2016-01-18 Thread Gopal Vijayaraghavan
>what is the difference between³hive.compute.splits.in.am=true²and >"hive.compute.splits.in.am=false"? >which value is better? First up, those options are specific to Tez. The old MapReduce model was to always compute splits before asking for resources to run. And this uses the gateway host (whe

Re: what is the difference between ³hive.compute.splits.in.am=true²and "hive.compute.splits.in.am=false"

2016-01-18 Thread Gopal Vijayaraghavan
>Thank-you so much for your quick response. Yea, the option is use only >for hive-on-tez. I want to know its source, its principle. in.am=true is the better option as it computes the splits after a job has been submitted. Imagine you have 3 tables in your query - with in.am=false, all the split

Re: HIVE CLI does not escape \t ?

2016-01-20 Thread Gopal Vijayaraghavan
> I¹m exporting a table with Hive CLI using hive ­f query.hql > file.tsv > Use ^A as a separator ... > Maybe using an alternative SerDe could solve that? Have you tried using the actual SerDe instead of the stdout formatter? INSERT OVERWRITE LOCAL DIRECTORY '...' ; The only issue I've notice

Re: HIVE CLI does not escape \t ?

2016-01-21 Thread Gopal Vijayaraghavan
>I use the workaround cat * >> output.tsv but that's not ideal. > >Any way to constrain the number of files to 1 automatically? I generally use an "ORDER BY 0" to insert a single reducer, which produces exactly 1 file. This is generally not a problem if you have say, <= 1 million rows. HDFS all

Re: 答复: Hive Bucketing

2016-01-25 Thread Gopal Vijayaraghavan
> Hi,how to efficient insert into an orc bucket table,I found it too >slow.thanks you Assuming you have partitions & slow inserts, you need to enable the flag from HIVE-6455 set hive.optimize.sort.dynamic.partition=true; Cheers, Gopal

Re: Hive Bucketing

2016-01-26 Thread Gopal Vijayaraghavan
> Ok so what is the resolution here? My understanding is that bucketing >does not improve the performance. Is that correct? There are no right answers here - I spend a lot of time fixing over-zealous optimization attempts

Re: "Create external table" nulling data from source table

2016-01-28 Thread Gopal Vijayaraghavan
> And again: the same row is correct if I export a small set of data, and >incorrect if I export a large set - so I think that file/data size has >something to do with this. My Phoenix vs LLAP benchmark hit size related issues in ETL. In my case, the tipping point was >1 hdfs block per CSV file.

Re: bloom filter used in 0.14?

2016-01-28 Thread Gopal Vijayaraghavan
> So I am questioning whether it is enabled on the version I am on, which >is 0.14. Does anyone know? https://issues.apache.org/jira/browse/HIVE-9188 - fix-version (1.2.0) The version you are using does not have bloom filter support. It should be ignoring the parameter and not generating any bl

Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Gopal Vijayaraghavan
> I dug a little deeper and it appears that the configuration property >"columns.types", which is used >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(), > is not being set. When I manually set that property in hive, your >example works fine. Good to know more about the NPE

Re: Apache hive Thrift PHP

2016-02-05 Thread Gopal Vijayaraghavan
>I have configured hadoop2.7 and apache hive 1.2.1. I want to connect >Apache hive to php using thrift. That's likely to have a lot of pain due to the way different PHP SAPIs handle timeouts. The only place where the thrift API might work correctly is the CLI mode. >/usr/local/hive/lib/php/pac

Re: Is Hive Index officially not recommended?

2016-02-08 Thread Gopal Vijayaraghavan
> Is anybody storing there index in a non-native table such as HBase? ... > Can you please point to implementations of HiveIndexHandler or >AbstractIndexHandler > that have usesIndexTable=false I don't think there are any publically available implementations yet. The Hive HBase-metastore project

Re: reading ORC format on Spark-SQL

2016-02-10 Thread Gopal Vijayaraghavan
> The reason why I am asking this kind of question is reading csv file on >Spark is linearly increasing as the data size increase a bit, but reading >ORC format on Spark-SQL is still same as the data size increses in >. ... > This cause is from (just property of reading ORC format) or (creating >t

Re: Record too large for Tez in-memory buffer...

2016-02-10 Thread Gopal Vijayaraghavan
Hey, > Trying to benchmark with Hive on Tez causes the following error. >Admittedly these are some very large looking records .. the same job runs >fine on MR2. ... > I'v attached the query explain tree. It fails in the very last reducer >phase .. Can you attach the explain plan with hive.exec

Re: hive on tez hadoop-common version problem.

2016-02-16 Thread Gopal Vijayaraghavan
> I have some problem with hive-on-tez. > email thread below is forwarding originally wrote to tez users. AFAIK, this problem only happens with CDH and never with pure Apache bigtop builds. Neither minimal JAR nor the cluster libs work as the problem is with the cluster jar ABIs - the recommende

Re: Tez issues with beeline via HS2

2016-02-17 Thread Gopal Vijayaraghavan
> * i used Gopal's fragment for tez-site.xml >(https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/tez-site.xml.frag) Please check that the tez.lib.uris is filled out properly. I suspect it's all already setup since the CLI mode works anyway, but cross-check that the HS2 classpath does have

Re: Hive 2 release what is new in this release please

2016-02-18 Thread Gopal Vijayaraghavan
> Is there such notes as what is new in Hive 2 say new features etc? Sergey had a in-depth presentation at the last meetup Notable omission - Jason's custom edge for Tez, which vecto

Re: What is the real meaning of negative value in Vertex.

2016-02-18 Thread Gopal Vijayaraghavan
Hi, If you use the newer in.place.progress UI, it will look much better as we have legends [1] which also shows killed tasks (due to pre-emption or to prevent DAG dead-locks). > Map 1: 0(+77,-185)/122 Map 2: 1/1 0(+77, -185)/122 = 0 tasks completed, 77 running, 185 failed attempts (will retry),

Re: Hive query on Tez slower than on MR (fails in some cases) ..

2016-02-18 Thread Gopal Vijayaraghavan
> On Tez, this is run as a single DAG of M-R+ ... Can't tell which vertex is the slow one in this. More tooling for isolating which vertex is taking up time (and which task) https://github.com/apache/tez/tree/master/tez-tools/swimlanes or alternatively run https://github.com/t3rmin4t0r/tez-s

Re: Hive query on Tez slower than on MR (fails in some cases) ..

2016-02-19 Thread Gopal Vijayaraghavan
Hi, > Here's the Tez DAG swimlane. Haven't gotten vertex.py to work.. will >send that too soon. Pretty clear that the map-side is fine - splitting sort buffers isn't bothering this at all. We want to over-partition Reducer 7 and possibly have all of them pick the total # of reducers dynamically

Re: /tmp/hive/hive is exceeded: limit=1048576 items=1048576

2016-02-22 Thread Gopal Vijayaraghavan
> Is there a setting somewhere to automatically remove old temp files from >/tmp/hive/hive? Yes, in Falcon's data retention (FALCON-870 ) That's been backported as a hot-fix for those who need to enforce Calendar based retention policies from a

Re: Anyway to avoid creating subdirectories by "Insert with union²

2016-02-23 Thread Gopal Vijayaraghavan
>Is there anyway to avoid creating sub-directories while running in tez? >Or this is by design and can not be changed? Yes, this is by design. The Tez execution of UNION is entirely parallel & the task-ids overlaps - so the files created have to have unique names. But the total counts for "Map 1

Re: Anyway to avoid creating subdirectories by "Insert with union²

2016-02-24 Thread Gopal Vijayaraghavan
> SET mapred.input.dir.recursive=TRUE; ... > Can we set above setting as tblProperties or Hive Table properties. Not directly, those are MapReduce properties - they are not settable via Hive tables. That said, you can write your own SemanticAnalyzerHooks to do pretty much anything you want like t

Re: Hive 2 performance

2016-02-25 Thread Gopal Vijayaraghavan
> Correct hence the question as I have done some preliminary tests on Hive >2. > I want to share insights with other people who have performed the same If you have feedback on Hive-2.0, I'm all ears. I'm building up 2.1 features & fixes, so now would be a good time to bring stuff up. Speed most

Re: Hive Cli ORC table read error with limit option

2016-02-29 Thread Gopal Vijayaraghavan
> Failed with exception java.io.IOException:java.lang.RuntimeException: >serious problem > Time taken: 0.32 seconds ... > Any one faced this issue. No, but that sounds like one of the codepaths I put in - is this a Kerberos secure cluster? Try disabling the optimization and see if it works. set

Re: Hive Cli ORC table read error with limit option

2016-02-29 Thread Gopal Vijayaraghavan
> Yes it is kerberos cluster. ... > After disabling the optimization in hive cli, it works with limit >option. Alright, then it is fixed in - https://issues.apache.org/jira/browse/HIVE-13120 Cheers, Gopal

Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join

2016-03-01 Thread Gopal Vijayaraghavan
(Bcc: Tez, Cross-post to hive) > I added ³set hive.execution.engine=mr;² at top of the script, seems the >result is correctŠ Pretty sure it's due to the same table aliases for both dummy tables (they're both called _dummy_table) auto join conversion. hive> set hive.auto.convert.join=false; Sho

Re: Wrong column is picked in HIVE 2.0.0 + TEZ 0.8.2 left join

2016-03-01 Thread Gopal Vijayaraghavan
On 3/1/16, 10:41 AM, "Sergey Shelukhin" wrote: >Can you please open a Hive JIRA? It is a bug. https://issues.apache.org/jira/browse/HIVE-13191 https://issues.apache.org/jira/browse/HIVE-13190 Cheers, Gopal

Re: queues, beeline/hs2 and tez

2016-03-01 Thread Gopal Vijayaraghavan
> tez.queue.name via the --hiveconf switch on beeline and it doesn't look >to me it works. the question is... should it? Nope, it shouldn't, because of Tez sessions the conf param is not job. The tez.queue.name can be changed while a JDBC connection is up, so it is not picked up from the conf &

Re: Hive Cli ORC table read error with limit option

2016-03-04 Thread Gopal Vijayaraghavan
> Any one has any idea about this.. Really stuck with this. ... > hive> select h from testdb.table_orc where year = 2016 and month =1 and >day >29 limit 10; Depends on whether any of those columns are paritition columns or not & whether the table is marked transactional. > Caused by: java.lang.I

Re: Hive Cli ORC table read error with limit option

2016-03-07 Thread Gopal Vijayaraghavan
> cvarchar(2) ... > Num Buckets: 7 I suspect this might be related to having 0 row files in the buckets not having any recorded schema. You can also experiment with hive.optimize.index.filter=false, to see if the zero row case is artificially produced via predi

Re: Hive 2 insert error

2016-03-07 Thread Gopal Vijayaraghavan
> Is this something new in Hive 2 as I don't recall having this issue >before? No. > | CREATE TABLE `sales3`( | > | `prod_id` bigint, | > | STORED AS INPUTFORMAT

Re: Simple UDFS and IN Operator

2016-03-08 Thread Gopal Vijayaraghavan
> In Hive 0.11, I¹ve written a UDF that returns a list of Integers. I¹d >like to use this in a WHERE clause of a query, something like SELECT * >FROM WHERE in ( getList()). ... > Joins would be ideal, but we haven¹t upgraded yet. IN() is actually rewritten into a JOIN (distinct ...) intern

Re: ODBC drivers for Hive 2

2016-03-10 Thread Gopal Vijayaraghavan
> If yes, maybe one should think about an open source one, which is >reliable and supports a richer set of Odbc functionality. I had a similar thought last week, which ended up with me discovering that the hive/odbc folder is full of dead code. I'm going to rm -rvf odbc/ with https://issues.apac

Re: Tez job submissions failing when cluster is under provisioned..

2016-03-10 Thread Gopal Vijayaraghavan
> This seems to be something YARN fair-scheduler reporting it this way.. >although Tez doesn't seem to handle. Pepperdata? > I did come across HIVE-12957, in which the fix patch seems to only >report the error better instead of doing anything about it. ... > Now comes my question, is this in a

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-13 Thread Gopal Vijayaraghavan
> We are using for Spark SQL : Does SparkSQL support Transactional tables? I thought Transactional tables needed a dead-lock proof LockManager, which was a hive-specific feature? Cheers, Gopal

Re: Issue with Star schema

2016-03-15 Thread Gopal Vijayaraghavan
>I have a query where I am joining with 10 other entities Are you using Tez? This looks like an obvious candidate for a broadcast join. Cheers, Gopal

Re: Tez reducer parallelism ..

2016-03-15 Thread Gopal Vijayaraghavan
> A lot of our queries do the following style of simultaneous windowing .. The windowing is not simultaneous unless they are all over the same window - the following query has 3 different windows applied over the same rows sequentially. > SELECT >row_number() OVER( PARTITION BY app, user, > t

Re: The build-in indexes in ORC file does not work.

2016-03-18 Thread Gopal Vijayaraghavan
> I have tried bloom filter ,but it makes no improvement。I know about > tez, but never use, I will try it later. ... >select count(*) from gprs where terminal_type=25080; > will not scan data > Time taken: 353.345 seconds CombineInputFormat does not do any split-elimination, so MapRed

Re: Tez reducer parallelism ..

2016-03-19 Thread Gopal Vijayaraghavan
> So you'r saying, since these windows are part of a single SELECT >projection they need to be serial? Yes, with a full shuffle of the result so far for each new OVER(). > row_number() OVER( PARTITION BY app, user, type ORDER BY ts >) as a_number, > row_number() OVER( P

Re: Hiding staging directory data on HDFS

2016-03-19 Thread Gopal Vijayaraghavan
> --1 Move .CSV data into HDFS staging area Per-user staging areas with Kerberos auth is standard practice. As long as you're not running a vanilla Apache install (i.e Ranger KMS + SSL certificates for KMS needed), you can encrypt users away from each other[1] or from threat of physical hardware

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Gopal Vijayaraghavan
> I love to see these ORC table optimization help but it is not obvious to >me under what circumstances they bare fruit. Are you using Tez or LLAP? Your explain plans are clearly missing the optimizations I've added as part of Stinger.next. https://github.com/apache/hive/blob/master/ql/src/test/

Re: Mechanism when doing a select *

2016-03-21 Thread Gopal Vijayaraghavan
> Or does all the data go directly from the datanodes to my client ? Not yet. https://issues.apache.org/jira/browse/HIVE-11527 Cheers, Gopal

Re: Mechanism when doing a select *

2016-03-21 Thread Gopal Vijayaraghavan
>> Or does all the data go directly from the datanodes to my client ? Not yet. https://issues.apache.org/jira/browse/HIVE-11527 Cheers, Gopal

Re: Automatic Update statistics on ORC tables in Hive

2016-03-27 Thread Gopal Vijayaraghavan
> This might be a bit far fetched but is there any plan for background >ANALYZE STATISTICS to be performed on ORC tables https://issues.apache.org/jira/browse/HIVE-12669 Cheers, Gopal

Re: What's the advised way to do groupby 2 attributes from a table with 1000 columns?

2016-03-27 Thread Gopal Vijayaraghavan
> I only need to query 3 columns, ... > The source table is about 1PB. Format of this table is extremely critical. A columnar data format like ORC is recommended to avoid reading any other columns when reading 3 out of 1000. > Will it be advised to do a subquery first, and then send it to the >

Re: Container out of memory: ORC format with many dynamic partitions

2016-04-30 Thread Gopal Vijayaraghavan
> SET hive.exec.orc.memory.pool=1.0; Might be a bad idea in general, this causes more OOMs than less. > SET mapred.map.child.java.opts=-Xmx2048M; > SET mapred.child.java.opts=-Xmx2048M; ... > Container >[pid=6278,containerID=container_e26_1460661845156_49295_01_000244] is >running beyond physic

Re: [VOTE] Bylaws change to allow some commits without review

2016-05-20 Thread Gopal Vijayaraghavan
> I've contacted all PMCs by now, hoping for some more votes. So if any of >you PMCs have a few minutes to read the proposal and vote that'd be great! +1. >> The exact sentence I propose to add is: "Minor issues (e.g. typos, code >> style issues, JavaDoc changes. At committer's discretion) can b

Re: Copying all Hive tables from Prod to UAT

2016-05-25 Thread Gopal Vijayaraghavan
> We are using HDP. Is there any feature in ambari Apache Falcon handles data lifecycle management, not Ambari. https://falcon.apache.org/0.8/HiveDR.html Cheers, Gopal

Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan
> In short at the simplest set up what Resource Manager it works with? Tez+Hive needs HDFS and YARN 2.6.0+ (preferably as close to an Apache build as possible - CDH clusters need more work). Hive2 needs Apache Slider 0.91 right now, to start the cache daemons on YARN (see SLIDER-82). > If so ki

Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan
> I do not use any vendor's product., All my own set up, build and >configure. My autobuild scripts should serve as readable documentation for this, since nearly everything's in a single Makefile with an install: target. Or take the easy route with $ make dist install In case you use the llap b

Re: My first TEZ job fails

2016-05-30 Thread Gopal Vijayaraghavan
> hduser@rhes564: /usr/lib/apache-tez-0.7.1-bin> hadoop jar >./tez-examples-0.7.1.jar orderedwordcount /tmp/input/test.txt >/tmp/out/test.log Sure, you're missing file:/// - the defaultFS is most like hdfs://:/ The inputs and outputs without a scheme prefix will go the defaultFS configured in cor

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Gopal Vijayaraghavan
> That being said all systems are evolving. Hive supports tez+llap which >is basically the in-memory support. There is a big difference between where LLAP & SparkSQL, which has to do with access pattern needs. The first one is related to the lifetime of the cache - the Spark RDD cache is per-use

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan
> but this sounds to me (without testing myself) adding caching capability >to TEZ to bring it on par with SPARK. Nope, that was the crux of the earlier email. "Caching" seems to be catch-all term misused in that comparison. >> There is a big difference between where LLAP & SparkSQL, which has

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan
> Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS. No, LLAP intermediates HDFS. It holds column & index data streams as-is (i.e dictionary encoding, RLE, bloom filters etc are preserved). Because it does not cache row-tuples, it cannot exist as a caching tool for another

Re: Error while running Hive on Microsoft Azure HDInsight Hadoop Linux Cluster

2016-06-02 Thread Gopal Vijayaraghavan
> it gives me the following error "FAILED: ParseException line 24:0 >missing EOF at 'format' near')'". Sounds like a file line ending problem - are you sure the file isn't in a different line ending mode? > -- The following lines describe the format and location of the file You can try removing

Re: Using Hive table for twitter data

2016-06-09 Thread Gopal Vijayaraghavan
> Has anyone done recent load of twitter data into Hive table. Not anytime recently, but the twitter corpus was heavily used to demo Hive. Here's the original post on auto-learning schemas from an arbitrary collection of JSON docs (like a MongoDB dump). http://hortonworks.com/blog/discovering-h

Re: Using Hive table for twitter data

2016-06-09 Thread Gopal Vijayaraghavan
> Any reason why that table in Hive cannot read data in? No idea how you're loading data with flume, but it isn't doing it right. >> PARTITIONED BY (datehour INT) ... >> -rw-r--r-- 2 hduser supergroup 433868 2016-06-09 09:52 >>/twitter_data/FlumeData.1465462333430 No ideas on how to get

Re: Optimized Hive query

2016-06-14 Thread Gopal Vijayaraghavan
> You can see that you get identical execution plans for the nested query >and the flatten one. Wasn't that always though. Back when I started with Hive, before Stinger, it didn't have the identity project remover. To know if your version has this fix, try looking at hive> set hive.optimize.rem

Re: Optimized Hive query

2016-06-14 Thread Gopal Vijayaraghavan
> So I was hoping of using internal Hive CBO to somehow change the AST >generated for the query somehow. Hive does have an "explain rewrite" but that prints out the query before CBO runs. For CBO, you need to dig all the way down to the ASTBuilder class and work upwards from there. Perhaps add

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Gopal Vijayaraghavan
> is hosting the HiveServer2 is merely sending data with around 3 MB/sec. >Our network is capable of much more. Playing around with `fetchSize` did >not increase throughput. ... > --hiveconf >mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec >\ The current implementation

Re: Optimize Hive Query

2016-06-22 Thread Gopal Vijayaraghavan
> Long running query : Are you running this on MapReduce or Tez? Please post the output of explain - if you are seeing > 1 shuffle edge in your query while having only one window for OVER(), that might be the reason. OVER ( PARTITION BY m_d_key , sb_gu_key ORDER BY t_ev_st_dt) The multipl

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Gopal Vijayaraghavan
> Though, I'm also wondering about about performance difference between >the two. Since they both use native implementations, theoretically they >can be close in performance. ZlibCompressor block compression was extremely slow due to the non-JNI bits in Hadoop -

Re: Optimize Hive Query

2016-06-24 Thread Gopal Vijayaraghavan
> Please help me on thislet me know you need other info. Are the ORC tables fully compacted? Looks like you're running a version of Hive-ACID, which does not perform well without compacting delta files. dfs -ls ; should tell you whether there are any delta_* files in the list. > |

Re: Optimize Hive Query

2016-06-24 Thread Gopal Vijayaraghavan
> Yes for this tables, ACID enabled. it has only 256 files for each >buckets. these are create only when data initially loaded in this table. Yes, the initial load goes in as an insert DELTA too - that requires another compaction to move into base files. The fact that they haven't been automati

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Gopal Vijayaraghavan
> While our StorageHandler does utilize a SERDE that correctly returns >SerDeStats, it seems like the optimizer is ignoring these values. AFAIK, the stats impl is assumed to be approximate & aggregate and is never used for setting up execution. > Would anyone know how to correctly set these valu

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Gopal Vijayaraghavan
>Do you know how the number of splits is calculated? To do that properly needs a whiteboard and a couple of hours - with the primary complex variable being the YARN headroom calculation. The simplest way to put it would be that it compute splits, tries to find out the available capacity and trie

Re: Hash table in map join - Hive

2016-06-27 Thread Gopal Vijayaraghavan
> 1. Is there a way to check the size of the hash table created during map >side join in Hive/Tez? Only from the log files. However, you enable hive.tez.exec.print.summary=true; then the hive CLI will print out the total # of items shuffle from the broadcast edges feeding the hashtable. Not sure

Re: Querying Hive tables from Spark

2016-06-27 Thread Gopal Vijayaraghavan
> It appears to me that Spark does not rely on statistics that are >collected by Hive on say ORC tables. > It seems that Spark uses its own optimization to query the Hive tables >irrespective of Hive has collected by way of statistics etc? Spark does not have a cost based optimizer yet - please fo

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-27 Thread Gopal Vijayaraghavan
>Correct me if I¹m wrong but at this point isn¹t the number of splits >calculated? Yes you are correct, but the grouping kicks in after that. The real reason for grouping is because Shuffle operations are internally MxN and explode out of control if grouping hasn't been done. Running through 50

Re: Querying Hive tables from Spark

2016-06-27 Thread Gopal Vijayaraghavan
> I added a compact index to this table as below on 5 columns No, those are not what I recommend in this scenario. You made a statement that the table was sorted and it wasn't. >>Table is sorted in the order of prod_id, cust_id,time_id, channel_id and >> promo_id. It has 22 million rows. >> No

Re: Hash table in map join - Hive

2016-06-27 Thread Gopal Vijayaraghavan
> 1. OOM condition -- I get the following error when I force a map join in >hive/tez with low container size and heap size:" >java.lang.OutOfMemoryError: Java heap space". I was wondering what is the >condition which leads to this error. You are not modifying the noconditionaltasksize to match th

Re: Hive error : Can not convert struct<> to

2016-06-28 Thread Gopal Vijayaraghavan
> PARTITION(state='CA') > SELECT * WHERE se.adr.st='CA' > FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into >target table because column number/types are different ''CA'': The error is bogus, but the issue has to do with the "SELECT *". Inserts where a partition is specified

Re: Hash table in map join - Hive

2016-06-30 Thread Gopal Vijayaraghavan
> 1. In the query plan, it still says Map Join Operator (Would have >expected it to be named as Reduce side operator). The "Map" in that case refers really to Map rather the hadoop version. An unambigous name is if it were called the HashJoinOperator. This is one of the optimizations of Tez righ

Re: Possible Bug: to_date("2015-01-15") returns a string

2016-06-30 Thread Gopal Vijayaraghavan
> I ran into this unusual behavior while converting a date string into a >date. If you're on Hive-1.x, this couldn't be fixed due to backwards compatible requirements. If I remember correctly, to_date() pre-dates the Date type in Hive. Marked incompatible for backport -

Re: Hash table in map join - Hive

2016-06-30 Thread Gopal Vijayaraghavan
> But, I got a comment from the author that, the patch wouldn't affect -- >hive.tez.auto.reducer.parallelism=true. > Am I missing something? No, I've linked to the wrong JIRA :( Cheers, Gopal

Re: Tez jobs on YARN failing sporadically..

2016-07-05 Thread Gopal Vijayaraghavan
> when the executor is overwhelmed with tasks or execute() is called while >shutting down. I'm confounded as to why this would be an issue suddenly. > Container container_e23_1466828114374_53316_01_18 finished with >diagnostics set to Container failed, exitCode=-1000. Task >java.util.concurr

Re: Hash table in map join - Hive

2016-07-06 Thread Gopal Vijayaraghavan
> I tried running the shuffle hash join with auto reducer parallelism >again. But, it didn't seem to take effect. With merge join and auto >reduce parallelism on, number of > reducers drops from 1009 to 337, but didn't see that change in case of >shuffle hash join .Should I be doing something more

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Gopal Vijayaraghavan
> Status: Finished successfully in 14.12 seconds > OK > 1 > Time taken: 14.38 seconds, Fetched: 1 row(s) That might be an improvement over MR, but that still feels far too slow. Parquet numbers are in general bad in Hive, but that's because the Parquet reader gets no actual love from th

Re: Hash table in map join - Hive

2016-07-14 Thread Gopal Vijayaraghavan
e-req for the last problem would be to go fix <https://issues.apache.org/jira/browse/TEZ-2962>, which tracks pre-compression sizes. I'll have to think a bit more about the other part of the problem. Cheers, Gopal On 7/6/16, 12:52 PM, "Gopal Vijayaraghavan" wrote: > >

Re: Hive on TEZ + LLAP

2016-07-15 Thread Gopal Vijayaraghavan
> I have also heard about Hortonworks with Tez + LLAP but that is a distro? Yes. AFAIK, during Hadoop Summit there was a HDP 2.5 techpreview sandbox instance which shipped Hive2 (scroll down all the way to end in the downloads page). Enable the "interactive mode" in Ambari for a HiveServer2 conf

Re: Hash table in map join - Hive

2016-07-15 Thread Gopal Vijayaraghavan
> When is OOM error actually thrown? With >hive.mapjoin.hybridgrace.hashtable set to true, spilling should be >possible, so OOM error should not come. ... > Is it the case when the hash table of not even one of the 16 partitions >fits in memory? It will OOM if any one of them overflows. The gra

Re: Yarn Application ID for Hive query

2016-07-18 Thread Gopal Vijayaraghavan
> be nice to have access to a command or API call in HiveServer2 similar >to MySQL¹s ³SHOW PROCESSLIST² (and equivalent commands in most other >databases). There is one - if you have the HiveServer2 UI (in 2.0), that can be seen. It would take 10-15 line JSP script to export that as a JSON API

Re: Hive on TEZ + LLAP

2016-07-18 Thread Gopal Vijayaraghavan
> Also has there been simple benchmarks to compare: > > 1. Hive on MR > 2. Hine on Tez > 3. Hive on Tez with LLAP I ran one today, with a small BI query in my test suite against a 1Tb data-set. TL;DR - MRv2 (203.317 seconds), Tez (13.681s), LLAP (3.809s). *Warning*: This is not a historical vi

Re: Hive on TEZ + LLAP

2016-07-18 Thread Gopal Vijayaraghavan
> These looks pretty impressive. What execution mode were you running >these? Yarn client may be? There is no other mode - everything runs on YARN. > 53 times The factor is actually bigger in actual execution. The MRv2 version takes 2.47s to prep a query, while the LLAP version takes 1.64s.

Re: Hive on TEZ + LLAP

2016-07-19 Thread Gopal Vijayaraghavan
> What was the type (Parquet, text, ORC etc) and row count for each three >tables above? I always use ORC for flat columnar data. ORC is designed to be ideal if you have measure/dimensions normalized into tables - most SQL workloads don't start with an indefinite depth tree. hive> select count(1

Re: Some dates add/less a day...

2016-07-29 Thread Gopal Vijayaraghavan
> 1946-10-01 ... > Any idea to help me? What timezone is this? The years between 1946 and 1966 are extremely strange for timezones for the US/Canada (the period between "WW II Time" ending and the Uniform Time Act of 1966). Cheers, Gopal

Re: Doubt on Hive Partitioning.

2016-08-01 Thread Gopal Vijayaraghavan
> WHERE p IN (SELECT p FROM t2) > here we could argue that Hive could optimize this by computing the sub >query first, > and then do the partition pruning, but sadly I don't think this >optimisation has been implemented yet It is implemented already -

Re: Hive transactional table with delta files, Spark cannot read and sends error

2016-08-01 Thread Gopal Vijayaraghavan
> Spark fails reading this table. What options do I have here? Would your issue be the same as https://issues.apache.org/jira/browse/SPARK-13129? LLAPContext in Spark can read those tables with ACID semantics (as in delete/updates will work right). var conn = LlapContext.newInstance(sc, hs2_u

Re: Hive transactional table with delta files, Spark cannot read and sends error

2016-08-01 Thread Gopal Vijayaraghavan
> I am on Spark 1.6.1 and getting the following error Ah, I realize that it's yet to be released officially. Here's the demo from HadoopSummit - I doubt this will ever be available for older spark releases but will be a datasource package lik

Re: Hive LIKE predicate. '_' wildcard decrease perfomance

2016-08-04 Thread Gopal Vijayaraghavan
> where res_url like '%mts.ru%' ... > where res_url like '%mts_ru%' ... > Why '_' wildcard decrease perfomance? Because it misses the fast path by just one "_". ORC vectorized reader has a zero-copy check for 3 patterns - prefix, suffix and middle. That means "https://%";, "%.html", "%mts.ru%" w

Re: Vectorised Query Execution extension

2016-08-04 Thread Gopal Vijayaraghavan
> Vectorized query execution streamlines operations by processing a block >of 1024 rows at a time. The real win of vectorization + columnar is that you get to take advantage of them at the same time. We get to execute the function once per 1024 rows when things are repeating - particularly true w

Re: hive concurrency not working

2016-08-05 Thread Gopal Vijayaraghavan
> Depends on how you configured scheduling in yarn ... ... >> you won't have this problem if you use Spark as the execution engine? >>That handles concurrency OK If I read this right, it is unlikely to be related to YARN configs. The Hue issue is directly related to how many Tez/Spark sessions

Re: beeline/hiveserver2 + logging

2016-08-09 Thread Gopal Vijayaraghavan
> not get the progress messages back until the query finishes which >somewhat defeats the purpose of interactive usage. That happens entirely on the client side btw. So to avoid a hard sleep() + check loop causing pointless HTTP traffic, HiveServer2 now does a long poll on the server side. hive.

Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-16 Thread Gopal Vijayaraghavan
> This problem has blocked me a whole week, anybodies have any ideas? This might be a race condition here. aclStatus.getEntries(); is being modified without being copied (oddly

<    1   2   3   4   >