> CREATE TABLE T . AS
The limit clause in the outermost query should prevent the entire query
from executing.
However, the CREATE TABLE expression and the UNION ALL are rather
challenging in this matter.
If you have queries which don't hit the NULL-scan fully, a BUG report
would be
> Yes, Kylin generated the query. I'm using Kylin 1.5.3.
I would report a bug to Kylin about DISTRIBUTE BY RAND().
This is what happens when a node which ran a Map task fails and the whole
task is retried.
Assume that the first attempt of the Map task0 wrote value1 into
reducer-99, because
> val d = HiveContext.read.format("jdbc").options(
...
>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>>the direct load option to 6. Hive is using MR framework.
In generaly, the jdbc implementations tend to react rather badly to large
extracts like this - the
Hi,
I'm writing this because I realize this stuff needs users to comment on before
it gets set into an ABI release.
Hive 2.2.0 will check if a user's UDF has a vectorized version before wrapping
it row-by-row.
@VectorizedExpressions(value = { VectorStringRot13.class })
> another case of a query hangin' in v2.1.0.
I'm not sure that's a hang. If you can repro this, can you please do a jstack
while it is "hanging" (like a jstack of hiveserver2 or cli)?
I have a theory that you're hitting a slow path in HDFS remote read because of
the following stacktrace.
> 1) confirm your beeline java process is indeed running with expanded memory
The OOM error is clearly coming from the HiveServer2 CBO codepath post beeline.
at
org.apache.calcite.rel.AbstractRelNode$1.explain_(AbstractRelNode.java:409)
at
> Are there any other ways?
Are you running Tez?
Tez heartbeats counters back to the AppMaster every few seconds, so the
AppMaster has an accurate (but delayed) count of HDFS_BYTES_WRITTEN.
Cheers,
Gopal
> I have a query that hangs (never returns). However, when i turn on
>logging to DEBUG level it works. I'm stumped.
Slowing down things does make a lot of stuff work - logging does something
more than slow things down, it actually introduces a synchronization point
(global lock) for each log.
> It will be ok if the file has more than two characters,that is a little
> interesting. I can not understand the result of function checkInputFormat is
> OrcInputFormat,maybe that is just right.
My guess is that it is trying to read the 3 letter string "ORC" from that file
and failing.
> java.sql.SQLException: Error while processing statement: FAILED: Execution
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to
> alter partition. alter is not possible
> Altering works fine in a test table in our dev env. The logs here aren't too
> helpful.
The
> Is there a way to create an external table on a directory, extract 'key' as
> file name and 'value' as file content and write to a sequence file table?
Do you care that it is a sequence file?
The HDFS HAR format was invented for this particular problem, check if the
"hadoop archive" command
> Dimensions change, and I'd rather do update than recreate a snapshot.
Slow changing dimensions are the common use-case for Hive's ACID MERGE.
The feature you need is most likely covered by
https://issues.apache.org/jira/browse/HIVE-10924
2nd comment from that JIRA
"Once an hour, a set of
> NULL::character%20varying)
...
> i want to say this is somehow related to a java version (we're using 8)
>but i'm not sure.
The "character varying" looks like a lot like a Postgres issue to me
(though character varying could be the real term for varchar in another
DB).
The hive-metastore.log
> anybody run up against this one? hive 2.1.0 + using a "not in" on a
>list + the column is a partition key participant.
The partition filters are run before the plan is generated.
>AND etl_source_database not in ('foo')
Is there a 'foo' in etl_source_database?
> predicate:
> not array_contains(array('foo'), partition_key)
And this is why that works.
https://issues.apache.org/jira/browse/HIVE-13951 :(
Cheers,
Gopal
> If I run a query with CREATE TABLE AS, it breaks with the error below.
> However, just running the query works if I don't try to create a table from
> the results. It does not happen to all CTAS queries.
Not sure if that's related to Tez at all.
Can try running it with
set
> I'm running into the below error occasionally and I'm not 100% certain what's
> going on. Does anyone have a hunch what might be happening here or where we
> can dig for more ideas? Removed row contents but there are multiple columns.
You can try a repro run by doing
set
> Could someone provide me with a code snippet (preferably Java) that installs
> the schema (through datanucleus) on my empty metastore (postgres)
I wish it was that simple, but do not leave it to the Hive startup to create it
- create it explicitly with schematool
> I'm wondering why Hive tries to scan all partitions when the quotes are
> omitted. Without the quotes, shouldn't 2016-11-28-00 get evaluated as an
> arithmetic expression, then get cast to a string, and then partitioning
> pruning still occur?
The order of evaluation is different - String =
> Thanx for the suggestion. It works with the setting you suggested.
>
> What does this mean? Do I need to special case this query.
You need to report a bug on https://issues.apache.org/jira/browse/HIVE
Because, this needs to get fixed.
> Turning off CBO cluster-wide won't be the right thing
> I've run into a GROUP BY that does not work reliably in the newer version:
> the GROUP BY results are not always fully aggregated. Instead, I get lots of
> duplicate + triplicate sets of group values. Seems like a Hive bug to me
That does sound like a bug, but this information is not enough
> Attached (assuming attachments actually work on this list) are three
> explains:
..
> TungstenAggregate(key=[artid#5,artorsec#0,page#11,geo#12,
The explain plan shows you are not using Hive-on-Spark, but SparkSQL.
> If you'd like any other info, or if you'd like me to test with other
> I am not sure what is going on here.
You can check the /tmp/$USER/hive.log and see what's happening in detail.
Cheers,
Gopal
> Thanx for the reply. We don't override the log level. According to the docs,
> looks like the default level is INFO.
> Any other ideas?
That at a first glance looks like a broken install. A good approach would be to
use a Tez cluster install instead of messing with a local mode runner
> The partition is by year/month/day/hour/minute. I have two directories - over
> two years, and the total number of records is 50Million.
That's a million partitions with 50 rows in each of them?
> I am seeing it takes more than 1hr to complete. Any thoughts, on what could
> be the issue or
> Actually, we don't have that many partitions - there are lot of gaps both in
> days and time events as well.
Your partition description sounded a lot like one of the FAQs from Mithun's
talks, which is why I asked
> I have spark with only one worker (same for HDFS) so running now a standalone
> server but with 25G and 14 cores on that worker.
Which version of Hive was this?
And was the input text file compressed with something like gzip?
Cheers,
Gopal
> I have also noticed that this execution mode is only applicable to single
> predicate search. It does not work with multiple predicates searches. Can
> someone confirms this please?
Can you explain what you mean?
Vectorization supports multiple & nested AND+OR predicates - with some extra
> set tez.task.resource.memory.mb to a different value than listed in
> tez-site.xml, the query that's run doesn't seem to pick up the setting and
> instead uses the one in the config file.
Why not use the setting Hive uses in the submitted vertex?
set hive.tez.container.size=?
Cheers,
> even that setting is not being applied after the hive shell is started and a
> query is executed.
Are you increasing it or decreasing it?
Tez will reuse existing larger containers, instead of releasing them - reducing
the parameter has almost no effect without a session restart.
Also
> If I have an orc table bucketed and sorted on a column, where does hive keep
> the mapping from column value to bucket? Specifically, if I know the column
> value, and need to find the specific hdfs file, is there an api to do this?
The closest to an API is
> Thanks Gopal. Yeah I'm using CloudBerry. Storage is Azure.
Makes sense, only an object store would have this.
> Are you saying this _0,1,2,3 are directories ?.
No, only the zero size "files".
This is really for compat with regular filesystems.
If you have /tmp/1/foo in an object
> For any insert operation, there will be one Zero bytes file. I would like to
> know importance of this Zero bytes file.
They are directories.
I'm assuming you're using S3A + screenshots from something like Bucket explorer.
These directory entries will not be shown if you do something like
> I want to know whether Beeline can handle HTTP redirect or not. I was
> wondering if some of Beeline experts can answer my question?
Beeline uses the hive-jdbc driver, which is the one actually handling network
connections.
That driver in turn, uses a standard
> My bad. Looks like the thrift server is cycling through various AMs it
> started when the thrift server was started. I think this is different from
> either Hive 2.0.1 or LLAP.
This has been roughly been possible since hive-1.0, if you follow any of the
Tez BI tuning guides over the last 4
> We are using a query with union all and groupby and same table is read
> multiple times in the union all subquery.
…
> When run with Mapreduce, the job is run in one stage consuming n mappers and
> m reducers and all union all scans are done with the same job.
The logical plans are identical
> by setting tez.am.mode.session=false in hive-cli and hive-jdbc via
> hive-server2.
That setting does not work if you do "set tez.am.*" parameters (any tez.am
params).
Can you try doing
hive --hiveconf tez.am.mode.session=false
instead of a set; param and see if that works?
Cheers,
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
I'm hoping this is not rewriting to the approx_distinct() in Presto.
> I got similar performance with Hive + LLAP too.
This is a logical plan issue, so I don't know if LLAP helps a lot.
A
> I'd like to remember that Hive supports ACID (in a very early stages yet) but
> most often that is a feature that most people don't use for real production
> systems.
Yes, you need ACID to maintain multiple writers correctly.
ACID does have a global primary key (which is not a single
> Is there anyway one can enable both (Kerberos and LDAP with SSL) on Hive?
I believe what you're looking for is Apache Knox SSO. And for LDAP users,
Apache Ranger user-sync handles auto-configuration.
That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1].
There might be a similar
> But on Hue or JDBC interface to Hive Server 2, the following error occurs
> while SELECT querying the view.
You should be getting identical errors for HS2 and CLI, so that suggests you
might be running different CLI and HS2 versions.
> SELECT COUNT(1) FROM pk_test where ds='2017-04-20';
>
Hi,
> java.lang.Exception: java.util.concurrent.ExecutionException:
> java.lang.NoSuchMethodError:
> org.apache.hadoop.tracing.SpanReceiverHost.getInstance(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/tracing/SpanReceiverHost;
There's a good possibility that you've built
> Running Hive 2.2 w/ LLAP enabled (tried the same thing in Hive 2.3 w/ LLAP),
> queries working but when we submit queries like the following (via our
> automated test framework), they just seem to hang with Parsing
> CommandOther queries seem to work fine Any idea on what's going on
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
…
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy
There are two things which prevent Hive from optimize multiple count distincts.
Another aggregate like a count(1) or a Grouping sets
> ERROR 2017-05-09 22:04:56,469 NetUtil.py:62 - SSLError: Failed to connect.
> Please check openssl library versions.
…
> I am using hive 2.1.0, slider 0.92.0, tez 0.8.5
AFAIK, this was reportedly fixed in 0.92.
https://issues.apache.org/jira/browse/SLIDER-942
I'm not sure if the fix in that
> for the slider 0.92, the patch is already applied, right?
Yes, except it has been refactored to a different place.
https://github.com/apache/incubator-slider/blob/branches/branch-0.92/slider-agent/src/main/python/agent/NetUtil.py#L44
Cheers,
Gopal
> NetUtil.py:60 - [Errno 8] _ssl.c:492: EOF occurred in violation of protocol
The error is directly related to the SSL verification error - TLSv1.0 vs
TLSv1.2.
JDK8 defaults to v1.2 and Python 2.6 defaults to v1.0.
Python 2.7.9 + the patch in 0.92 might be needed to get this to work.
AFAIK,
Hi,
I think this is worth fixing because this seems to be triggered by the data
quality itself - so let me dig in a bit into a couple more scenarios.
> hive.optimize.distinct.rewrite is True by default
FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer
now (which
> 1) both do the same thing.
The start of this thread is the exact opposite - trying to suggest ORC is
better for storage & wanting to use it.
> As it relates the columnar formats, it is silly arms race.
I'm not sure "silly" is the operative word - we've lost a lot of fragmentation
of the
> 1711647 -1032220119
Ok, so this is the hashCode skew issue, probably the one we already know about.
https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4
String hashcode distribution is much better in master after that. Hopefully
that fixes the distinct speed issue
> I guess I see different things. Having used all the tech. In particular for
> large hive queries I see OOM simply SCANNING THE INPUT of a data directory,
> after 20 seconds!
If you've got an LLAP deployment you're not happy with - this list is the right
place to air your grievances. I
> I kept hearing about vectorization, but later found out it was going to work
> if i used ORC.
Yes, it's a tautology - if you cared about performance, you'd use ORC, because
ORC is the fastest format.
And doing performance work to support folks who don't quite care about it, is
not exactly
> It is not that simple. The average Hadoop user has years 6-7 of data. They do
> not have a "magic" convert everything button. They also have legacy processes
> that don't/can't be converted.
…
> They do not want the "fastest format" they want "the fastest hive for their
> data".
I've yet
> We are looking at migrating files(less than 5 Mb of data in total) with
> variable record lengths from a mainframe system to hive.
https://issues.apache.org/jira/browse/HIVE-10856
+
https://github.com/rbheemana/Cobol-to-Hive/
came up on this list a while back.
> Are there other
> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
Ok, I misunderstood your gist.
> While ip is more unique that id, ip runs many times faster than id.
>
> How can I debug this ?
Nearly the same way - just replace "ip" with "id" in my
Hi,
> Does Hive LLAP work with Parquet format as well?
LLAP does work with the Parquet format, but it does not work very fast, because
the java Parquet reader is slow.
https://issues.apache.org/jira/browse/PARQUET-131
+
https://issues.apache.org/jira/browse/HIVE-14826
In particular to
> cast(NULL as bigint) as malone_id,
> cast(NULL as bigint) as zpid,
I ran this on master (with text vectorization off) and I get
20170626123 NULLNULL10
However, I think the backtracking for the columns is broken, somewhere - where
both the nulls
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException:
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)
LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a
directory?
Cheers,
Gopal
Hi,
> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create
> spark client.
I get inexplicable errors with Hive-on-Spark unless I do a three step build.
Build Hive first, use that version to
> Are there any frameworks like TPC-DS to benchmark Hive ACID functionality?
Are you trying to work on and improve Hive ACID?
I have a few ACID micro-benchmarks like this
https://github.com/t3rmin4t0r/acid2x-jmh
so that I can test the inner loops of ACID without having any ORC data at all.
> Caused by:
> org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError:
> VectorMapJoin Hash table loading exceeded memory limits.
> estimatedMemoryUsage: 1644167752 noconditionalTaskSize: 463667612
> inflationFactor: 2.0 threshold: 927335232 effectiveThreshold: 927335232
Most
> Now we need an explanation of "map" -- can you supply it?
The "map" mode runs all tasks with a TableScan operator inside LLAP instances
and all other tasks in Tez YARN containers. This is the LLAP + Tez hybrid mode,
which introduces some complexity in debugging a single query.
The "only"
> Or, is this an artifact of an incompatibility between ORC files written by
> the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde?
> 3. Is there a difference in the ORC file format spec. at play here?
Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x.
We
TL;DR - A Materialized view is a much more useful construct than trying to get
limited indexes to work.
That is pretty lively project which has been going on for a while with
Druid+LLAP
https://issues.apache.org/jira/browse/HIVE-14486
> This seems out of the blue but my initial benchmarks
> ) t_result where formable = ’t1'
…
> This sql using 29+ hours in 11 computers cluster within 600G memory.
> In my opinion, the time wasting in the `order by sampledate` and `calculate
> the table B’s record`. Is there a setting to avoid `table B`’s record not to
> get ‘avg_wfoy_b2’ column,
Hi,
If you've got the 1st starvation fixed (with Hadoop 2.8 patch), all these
configs + enable log4j2 async logging, you should definitely see a performance
improvement.
Here's the log patches, which need a corresponding LLAP config (& have to be
disabled in HS2, for the progress bar to work)
Hi,
> In our test, we found the shuffle stage of LLAP is very slow. Whether need to
> configure some related shuffle value or not?
Shuffle is the one hit by the 2nd, 3rd and 4th resource starvation issues
listed earlier (FDs, somaxconn & DNS UDP packet loss).
> And we get the following log
Hi,
> With these configurations, the cpu utilization of llap is very low.
Low CPU usage has been observed with LLAP due to RPC starvation.
I'm going to assume that the build you're testing is a raw Hadoop 2.7.3 with no
additional patches?
Hadoop-RPC is single-threaded & has a single mutex
Hi,
> Please help us find whether we use the wrong configuration. Thanks for your
> help.
Since there are no details, I'm not sure what configuration you are discussing
here.
A first step would be to check if LLAP cache is actually being used (the LLAP
IO in the explain), vectorization is
Hi,
> org.apache.hive.jdbc.HiveResultSetMetaData.getTableName(HiveResultSetMetaData.java:102)
https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveResultSetMetaData.java#L102
I don't think this issue is fixed in any release - this probably needs to go
into a
> Why jdbc read them as control symbols?
Most likely this is already fixed by
https://issues.apache.org/jira/browse/HIVE-1608
That pretty much makes the default as
set hive.query.result.fileformat=SequenceFile;
Cheers,
Gopal
> Then I am wondering if the merge statement is impracticable because
> of bad use of myself or because this feature is just not mature enough.
Since you haven't mentioned a Hive version here, I'm going to assume you're
some variant of Hive 1.x & that has some fundamental physical planning
> It is 2.7.3
+
> Error: java.io.IOException: java.lang.RuntimeException: ORC split generation
> failed with exception: java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=,code=0)
> So transactional tables only work with hdfs. Thanks for the confirmation
> Elliot.
No, that's not what said.
Streaming ingest into transactional tables requires strong filesystem
consistency and a flush-to-remote operation (hflush).
S3 supports neither of those things and HDFS is not the
> This is Hadoop 3.0.3
> java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=08S01,code=1)
> Something is missing here! Is this specific to ORC tables?
No, it is a Hadoop BUG.
https://issues.apache.org/jira/browse/HADOOP-1468
> delta_000_000
...
> I am using Glue data catalog as metastore, so should there be any link up to
> these tables from hive?
That would be why transactions are returning as 0 (there is never a transaction
0), because it is not using a Hive standard metastore.
You might not be able to
> We are copying data from upstream system into our storage S3. As part of
> copy, directories along with Zero bytes files are been copied.
Is this exactly the same issue as the previous thread or a different one?
> . I didn't see data skew for that reducer. It has similar amount of
> REDUCE_INPUT_RECORDS as other reducers.
…
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 rows for
> join key [4092813312923569]
The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is
Hi,
> I wanted to understand why hive has a performance issue with using _
> character in queries.
This is somewhat of a missed optimization issue - the "%" impl uses a fast
BoyerMoore algorithm and avoids converting from utf-8 bytes -> String.
> For example, a Hive job may start Tez containers, which then retrieve data
> from LLAP running concurrently. In the current implementation, this is
> unrealistic
That is how LLAP was built - to push work from Tez to LLAP vertex by vertex,
instead of an all-or-nothing implementation.
Here
> However, ideally we wish to manipulate the original query as delivered by the
> user (or as close to it as possible), and we’re finding that the tree has
> been modified significantly by the time it hits the hook
That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the
> I am interested in working on a project that takes a large number of Hive
> queries (as well as their meta data like amount of resources used etc) and
> find out common sub queries and expensive query groups etc.
This was roughly the central research topic of one of the Hive CBO devs,
> By the way, if you want near-real-time tables with Hive, maybe you should
> have a look at this project from Uber: https://uber.github.io/hudi/
> I don't know how mature it is yet, but I think it aims at solving that kind
> of challenge.
Depending on your hive setup, you don't need a
A hive version would help to preface this, because that matters for this (like
TEZ-3709 doesn't apply for hive-1.2).
> I’m trying to simply change the format of a very large partitioned table from
> Json to ORC. I’m finding that it is unexpectedly resource intensive,
> primarily due to a
> Will it be referring to orc metadata or it will be loading the whole file and
> then counting the rows.
Depends on the partial-scan setting or if it is computing full column stats
(the full column stats does an nDV, which reads all rows).
hive> analyze table compute statistics ...
> My conclusion is that a query can update some internal states of HiveServer2,
> affecting DAG generation for subsequent queries.
Other than the automatic reoptimization feature, there's two other potential
suspects.
First one would be to disable the in-memory stats cache's variance param,
> Or a simple insert will be automatically sorted as the table DDL mention ?
Simple insert should do the sorting, older versions of Hive had ability to
disable that (which is a bad thing & therefore these settings are now just
hard-configed to =true in Hive3.x)
-- set
> "TBLPROPERTIES ("orc.compress"="Snappy"); "
That doesn't use the Hadoop SnappyCodec, but uses a pure-java version (which is
slower, but always works).
The Hadoop snappyCodec needs libsnappy installed on all hosts.
Cheers,
Gopal
> Search ’Total length’ in log sys_dag_xxx, it is 2147483648.
This is the INT_MAX “placeholder” value for uncompacted ACID tables.
This is because with ACIDv1 there is no way to generate splits against
uncompacted files, so this gets “an empty bucket + unknown number of inserts +
updates”
> msck repair table ;
msck repair does not work on ACID tables.
In Hive 2.x, there is no way to move, replicate or rehydrate ACID tables from a
cold store - the only way it works if you connect to the old metastore.
Cheers,
Gopal
>query the external table using HiveCLI (e.g. SELECT * FROM
>my_external_table), HiveCLI prints out a table with the correct
If the error is always on a "select *", then the issue might be the SerDe's
handling of included columns.
Check what you get for
colNames =
> Because I believe string should be able to handle integer as well.
No, because it is not a lossless conversion. Comparisons are lost.
"9" > "11", but 9 < 11
Even float -> double is lossy (because of epsilon).
You can always apply the Hive workaround suggested, otherwise you might find
Hi,
> on some days parquet was created by hive 2.1.1 and on some days it was
> created by using glue
…
> After some drill down i saw schema of columns inside both type of parquet
> file using parquet tool and found different data types for some column
...
> optional int32 action_date (DATE);
>
> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3
> daemons for a total of 34816 MB.
Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL
workloads, but this is a different problem.
> PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string)
>
Hi,
> Would this also ensure that all the existing data compressed in snappy format
> and the new data stored in zlib format can work in tandem with no disruptions
> or issues to end users who query the table.
Yes.
Each file encodes its own compressor kind & readers use that. The writers
> It also shows that the process is consuming more than 30GB. However, it is
> not clear what is causing the process to consume more than 30GB.
The Xmx only applies to the heap size, there's another factor that is usually
ignored which are the network buffers and compression buffers used by
ot;
so there asking "where is the Hive bucketing spec". Is it just to read the
code for that function? They were looking for something more explicit, I think.
Thanks
- Original Message -
From: "Gopal Vijayaraghavan" <gop...@apache
There's more here than Bucketing or Tez.
> PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
I hope the epoch partition column is actually a day rollup and not 1 partition
for every timestamp.
CLUSTERED BY does not CLUSTER BY, which it should (but it
>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.
> My understanding was that this was related to the number of buckets, but you
> are relating it to ORC stripes. It is odd that no examples that I've seen
> include the SORTED BY statement other than in relation to
Hi,
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453)
In general HDP specific issues tend to get more attention on HCC, but this is a
pretty old issue stemming from MapReduce being designed for fairly
201 - 300 of 316 matches
Mail list logo