Hi all,
Due to a bug https://issues.apache.org/jira/browse/IVY-899 of Ivy, SBT
tries to download .orbit instead of .jar files and causing problems. This
bug has been fixed in Ivy 2.3.0, but SBT 0.13.1 still uses Ivy 2.0. Aaron
had kindly provided a workaround in PR
profile), and why once
the build fail, all JDBC suites fail together.
Working on a patch to fix this. Thanks to Patrick for helping debugging this!
On Jul 28, 2014, at 10:07 AM, Cheng Lian l...@databricks.com wrote:
I’m looking into this, will fix this ASAP, sorry for the inconvenience
AFAIK, according a recent talk, Hulu team in China has built Spark SQL
against Hive 0.13 (or 0.13.1?) successfully. Basically they also
re-packaged Hive 0.13 as what the Spark team did. The slides of the talk
hasn't been released yet though.
On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu
, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com
wrote:
AFAIK, according a recent talk, Hulu team in China has built Spark SQL
against Hive 0.13 (or 0.13.1?) successfully. Basically they also
re-packaged Hive 0.13 as what the Spark team did. The slides of the talk
hasn't been
It’s also useful to set hive.exec.mode.local.auto to true to accelerate the
test.
On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com
wrote:
It seems that the HiveCompatibilitySuite need a hadoop and hive
environment, am I right?
Relative path in absolute URI:
Just opened a PR based on the branch Patrick mentioned for this issue
https://github.com/apache/spark/pull/1864
On Sat, Aug 9, 2014 at 6:48 AM, Patrick Wendell pwend...@gmail.com wrote:
Cheng Lian also has a fix for this. I've asked him to make a PR - he
is on China time so it probably won't
In the long run, as Michael suggested in his Spark Summit 14 talk, we’d like to
implement SQL-92, maybe with the help of Optiq.
On Aug 15, 2014, at 1:13 PM, Cheng, Hao hao.ch...@intel.com wrote:
Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and
only support some
The exception indicates that the forked process doesn’t executed as
expected, thus the test case *should* fail.
Instead of replacing the exception with a logWarning, capturing and
printing stdout/stderr of the forked process can be helpful for diagnosis.
Currently the only information we have at
://github.com/apache/spark/pull/1856/files
On Tue, Aug 19, 2014 at 2:55 PM, scwf wangf...@huawei.com wrote:
hi,Cheng Lian
thanks, printing stdout/stderr of the forked process is more reasonable.
On 2014/8/19 13:35, Cheng Lian wrote:
The exception indicates that the forked process doesn’t executed
You may start from here
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712
.
On Mon, Aug 25, 2014 at 9:05 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I've exercised multiple
I believe in your case, the “magic” happens in TableReader.fillObject
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712.
Here we unwrap the field value according to the object inspector of that
Colon is not allowed to be part of a Windows file name and I think Git just
cannot create this file while cloning. Remove the colon in the name string
of this test case
You can just start the work :)
On Thu, Aug 28, 2014 at 3:52 PM, Bill Bejeck bbej...@gmail.com wrote:
Hi,
How do I get a starter task jira ticket assigned to myself? Or do I just do
the work and issue a pull request with the associated jira number?
Thanks,
Bill
+1. Tested Spark SQL Thrift server and CLI against a single node standalone
cluster.
On Thu, Aug 28, 2014 at 9:27 PM, Timothy Chen tnac...@gmail.com wrote:
+1 Make-distrubtion works, and also tested simple spark jobs on Spark
on Mesos on 8 node Mesos cluster.
Tim
On Thu, Aug 28, 2014 at
Just noticed one thing: although --with-hive is deprecated by -Phive,
make-distribution.sh still relies on $SPARK_HIVE (which was controlled by
--with-hive) to determine whether to include datanucleus jar files. This
means we have to do something like SPARK_HIVE=true ./make-distribution.sh
... to
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)
Maybe we should add a developer notes page to document all these useful
black magic.
On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:
Having a SSD help tremendously with assembly time.
Without that, you can
On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com)
wrote:
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)
Maybe we should add a developer notes page to document all these useful
black magic.
On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins
:)
On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com
wrote:
Welcome Shane =)
- Henry
On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote:
so, i had a meeting w/the
+1
- Tested Thrift server and SQL CLI locally on OSX 10.9.
- Checked datanucleus dependencies in distribution tarball built by
make-distribution.sh without SPARK_HIVE defined.
On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:
+1
Tested Scala/MLlib apps on
+1.
Tested locally on OSX 10.9, built with Hadoop 2.4.1
- Checked Datanucleus jar files
- Tested Spark SQL Thrift server and CLI under local mode and standalone
cluster against MySQL backed metastore
On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen rosenvi...@gmail.com wrote:
+1. Tested on
I don’t think so. For example, we’ve already added extended syntax like CACHE
TABLE.
On Wed, Sep 24, 2014 at 3:27 PM, Yi Tian tianyi.asiai...@gmail.com wrote:
Hi Reynold!
Will sparkSQL strictly obey the HQL syntax ?
For example, the cube function.
In other words, the hiveContext of
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
Since we can easily catch the list of all changed files in a PR, I think
we can start with adding the no trailing space check for newly changed
files only?
On 10/2/14 9:24 AM, Nicholas Chammas wrote:
Yeah, I remember that hell when I added PEP 8 to the build checks and fixed
all the
Hm, seems that 7u71 comes back again. Observed similar Kinesis
compilation error just now:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71.
However, /usr/bin/javac -version says:
It's a new pull request builder written by Josh, integrated into our
state-of-the-art PR dashboard :)
On 10/21/14 9:33 PM, Nan Zhu wrote:
just curious…what is this “NewSparkPullRequestBuilder”?
Best,
--
Nan Zhu
On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
Hm, seems that 7u71
Hi Marcelo, yes this is a known Spark SQL bug and we've got PRs to fix it
(2887 2967). Not merged yet because newly merged Hive 0.13.1 support
causes some conflicts. Thanks for reporting this :)
On Tue, Oct 28, 2014 at 6:41 AM, Marcelo Vanzin van...@cloudera.com wrote:
Well, looks like a huge
My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac compatibility
bug months ago, and you may want to use the most recent version here
https://github.com/scala/scala-dist/blob/master/tool-support/src/emacs/contrib/dot-ctags
On Tue, Oct 28, 2014 at 4:26 PM, Duy Huynh
Yes, these two combinations work for me.
On 10/29/14 12:32 PM, Zhan Zhang wrote:
-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is to enable
hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but
expected to go to upstream soon (Spark-3720).
Thanks.
Zhan
maven
- which has always been working already.
Do you have instructions for building in IJ?
2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com:
Yes, these two combinations work for me.
On 10/29/14 12:32 PM, Zhan Zhang wrote:
-Phive
java...@gmail.com wrote:
I am interested specifically in how to build (and hopefully run/debug..)
under Intellij. Your posts sound like command line maven - which has always
been working already.
Do you have instructions for building in IJ?
2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs
profiles. I was able to run spark core tests from
within IntelliJ. Didn't try anything beyond that, but FWIW this
worked.
- Patrick
On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian lian.cs@gmail.com wrote:
You may first open the root pom.xml file in IDEA, and then go for menu View
/ Tool Windows
I often see this when I first build the whole Spark project with SBT, then
modify some code and tries to build and debug within IDEA, or vice versa. A
clean rebuild can always solve this.
On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell pwend...@gmail.com wrote:
Does this happen if you clean
+1 since this is already the de facto model we are using.
On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote:
+1
发自我的 iPhone
在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道:
+1 great idea.
On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:
Hey Sadhan,
I really don't think this is Spark log... Unlike Shark, Spark SQL
doesn't even provide a Hive mode to let you execute queries against
Hive. Would you please check whether there is an existing HiveServer2
running there? Spark SQL HiveThriftServer2 is just a Spark port of
.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2Fsi=6222577584832512pi=626685a9-b628-43cc-91a1-93636171ce77
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill
job_1414084656759_0142
On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com
one more question - does that mean that we still
need enough memory in the cluster to uncompress the data before it can
be compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote
+1
Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
are fixed. Hive version inspection works as expected.
On 11/15/14 8:25 AM, Zach Fry wrote:
+0
I expect to start testing on Monday but won't have enough results to change
my vote from +0
until Monday night or Tuesday
Hey Zhan,
This is a great question. We are also seeking for a stable API/protocol
that works with multiple Hive versions (esp. 0.12+). SPARK-4114
https://issues.apache.org/jira/browse/SPARK-4114 was opened for this.
Did some research into HCatalog recently, but I must confess that I’m
not an
Should emphasize that this is still a quick and rough conclusion, will
investigate this in more detail after 1.2.0 release. Anyway we really
like to provide Hive support in Spark SQL as smooth and clean as
possible for both developers and end users.
On 11/22/14 11:05 PM, Cheng Lian wrote
It's already fixed in the master branch. Sorry that we forgot to update
this before releasing 1.2.0 and caused you trouble...
Cheng
On 2/2/15 2:03 PM, ankits wrote:
Great, thank you very much. I was confused because this is in the docs:
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable|
since Spark 1.2.0. The reason why your web UI didn’t show you the cached
table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-)
Simply add a |.collect()| after the |sql(...)| call.
Cheng
On 2/2/15 12:23 PM,
Hey Yi,
I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would
like to investigate this issue later. Would you please open an JIRA for
it? Thanks!
Cheng
On 1/19/15 1:00 AM, Yi Tian wrote:
Is there any way to support multiple users executing SQL on one thrift
server?
I
For the second question, we do plan to support Hive 0.14, possibly in
Spark 1.4.0.
For the first question:
1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
own Parquet support
Here is a toy |spark-shell| session snippet that can show the memory
consumption difference:
|import org.apache.spark.sql.SQLContext
import sc._
val sqlContext = new SQLContext(sc)
import sqlContext._
setConf(spark.sql.shuffle.partitions,1)
case class KV(key:Int, value:String)
Hi Aniket,
In general the schema of all rows in a single table must be same. This
is a basic assumption made by Spark SQL. Schema union does make sense,
and we're planning to support this for Parquet. But as you've mentioned,
it doesn't help if types of different versions of a column differ
Talked with Yi offline, personally I think this feature is pretty
useful, and the design makes sense, and he's already got a running
prototype.
Yi, would you mind to open a PR for this? Thanks!
Cheng
On 1/6/15 5:25 PM, Yi Tian wrote:
Hi, all
I have create a JIRA ticket about adding a
Oh sorry, I misread your question. I thought you were trying something
like |parquetFile(“s3n://file1,hdfs://file2”)|. Yeah, it’s a valid bug.
Thanks for opening the JIRA ticket and the PR!
Cheng
On 3/16/15 6:39 PM, Cheng Lian wrote:
Hi Pei-Lun,
We intentionally disallowed passing
Hi Pei-Lun,
We intentionally disallowed passing multiple comma separated paths in
1.3.0. One of the reason is that users report that this fail when a file
path contain an actual comma in it. In your case, you may do something
like this:
|val s3nDF = parquetFile(s3n://...)
val hdfsDF =
It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/
But this page is updated (1.3.0)
http://spark.apache.org/docs/latest/index.html
Cheng
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For
on
the same ShuffledRDD.
I think only shuffle write which generates shuffle files will have chance to
meet name conflicts, multiple times of shuffle read is acceptable as the code
snippet shows.
Thanks
Jerry
-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent
I meant using |saveAsParquetFile|. As for partition number, you can
always control it with |spark.sql.shuffle.partitions| property.
Cheng
On 2/23/15 1:38 PM, nitin wrote:
I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses
its own Parquet support to read partitioned Parquet tables declared in
Hive metastore. Only writing to partitioned tables is not covered yet.
These improvements will be included in Spark 1.3.0.
Just created SPARK-5948 to
How about persisting the computed result table first before caching it?
So that you only need to cache the result table after restarting your
service without recomputing it. Somewhat like checkpointing.
Cheng
On 2/22/15 12:55 AM, nitin wrote:
Hi All,
I intend to build a long running spark
Ah, sorry for not being clear enough.
So now in Spark 1.3.0, we have two Parquet support implementations, the
old one is tightly coupled with the Spark SQL framework, while the new
one is based on data sources API. In both versions, we try to intercept
operations over Parquet tables
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't
got time to get it merged.
Considering the release is close, I can cherry-pick those Hive 12 fixes
from #4107 and open a more surgical PR soon.
Cheng
On 2/24/15 4:18 AM, Michael Armbrust wrote:
On Sun, Feb 22, 2015 at
Hi Masaki,
I guess what you saw is the partition number of the last stage, which
must be 1 to perform the global phase of LIMIT. To tune partition number
of normal shuffles like joins, you may resort to
spark.sql.shuffle.partitions.
Cheng
On 2/26/15 5:31 PM, masaki rikitoku wrote:
Hi all
Yes, when a DataFrame is cached in memory, it's stored in an efficient
columnar format. And you can also easily persist it on disk using
Parquet, which is also columnar.
Cheng
On 1/29/15 1:24 PM, Koert Kuipers wrote:
to me the word DataFrame does come with certain expectations. one of them
Forgot to mention that you can find it here
https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala.
On 1/29/15 1:59 PM, Cheng Lian wrote:
Yes, when a DataFrame is cached in memory, it's
Thanks for reporting this! Would you mind to open JIRA tickets for both
Spark and Parquet?
I'm not sure whether Parquet declares somewhere the user mustn't reuse
byte arrays when using binary type. If it does, then it's a Spark bug.
Anyway, this should be fixed.
Cheng
On 4/12/15 1:50 PM,
I found in general it's a pain to build/run Spark inside IntelliJ IDEA.
I guess most people resort to this approach so that they can leverage
the integrated debugger to debug and/or learn Spark internals. A more
convenient way I'm using recently is resorting to the remote debugging
feature. In
We only shaded protobuf dependencies because of compatibility issues.
The source code is not modified.
On 6/10/15 1:55 PM, wangtao (A) wrote:
Hi guys,
I see group id of akka used in spark is “org.spark-project.akka”. What
is its difference with the typesafe one? What is its version? And
download from driver and setup classpath
Right?
But somehow, the first step fails.
Even if I can make the first step works(use option1), it seems that
the classpath in driver is not correctly set.
Thanks
Dong Lei
*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 2:32
Since the jars are already on HDFS, you can access them directly in your
Spark application without using --jars
Cheng
On 6/11/15 11:04 AM, Dong Lei wrote:
Hi spark-dev:
I can not use a hdfs location for the “--jars” or “--files” option
when doing a spark-submit in a standalone cluster
the driver will not need to setup a HTTP file server for
this scenario and the worker will fetch the jars and files from HDFS?
Thanks
Dong Lei
*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject
Hi Gil,
Sorry for the late reply and thanks for raising this question. The file
listing logic in HadoopFsRelation is intentionally made different from
Hadoop FileInputFormat. Here are the reasons:
1. Efficiency: when computing RDD partitions,
FileInputFormat.listStatus() is called on the
Hi all,
The unreleased version 1.6.0 has was removed from JIRA due to my
misoperation. I've added it back, but JIRA tickets that once targeted to
1.6.0 now have empty target version/s. If you found tickets that should
have targeted to 1.6.0, please help marking the target version/s field
Hey Hyukjin,
Sorry that I missed the JIRA ticket. Thanks for bring this issue up
here, your detailed investigation.
From my side, I think this is a bug of Parquet. Parquet was designed to
support schema evolution. When scanning a Parquet, if a column exists in
the requested schema but
Yeah, two of the reasons why the built-in in-memory columnar storage
doesn't achieve comparable compression ratio as Parquet are:
1. The in-memory columnar representation doesn't handle nested types. So
array/map/struct values are not compressed.
2. Parquet may use more than one kind of
Hi Shane,
I found that Jenkins has been in the status of "Jenkins is going to shut
down" for at least 4 hours (from ~23:30 Dec 9 to 3:45 Dec 10, PDT). Not
sure whether this is part of the schedule or related?
Cheng
On Thu, Dec 10, 2015 at 3:56 AM, shane knapp wrote:
>
+1
On 12/23/15 12:39 PM, Yin Huai wrote:
+1
On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee > wrote:
+1
On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson > wrote:
+1
On
Hey Pedro,
SQL programming guide is being updated. Here's the PR, but not merged
yet: https://github.com/apache/spark/pull/13592
Cheng
On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
Hi All,
At my workplace we are starting to use Datasets in 1.6.1 and even more
with Spark 2.0 in place of
.
Should I take discussion to your PR?
Pedro
On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <lian.cs@gmail.com
<mailto:lian.cs@gmail.com>> wrote:
Hey Pedro,
SQL programming guide is being updated. Here's the PR, but not
merged yet: https://github.com/apache/spar
Awesome! Congrats and welcome!!
On 2/9/16 2:55 AM, Shixiong(Ryan) Zhu wrote:
Congrats!!! Herman and Wenchen!!!
On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote:
On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia
Awesome! Congrats and welcome!!
Cheng
On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu
wrote:
> Congrats!!! Herman and Wenchen!!!
>
>
> On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende
> wrote:
>
>>
>>
>> On Mon, Feb 8, 2016 at 9:15 AM, Matei
Sorry for being late, I'm building a Spark branch based on the most
recent master to test out 1.8.2-rc1, will post my result here ASAP.
Cheng
On 1/23/17 11:43 AM, Julien Le Dem wrote:
Hi Spark dev,
Here is the voting thread for parquet 1.8.2 release.
Cheng or someone else we would appreciate
?
-- Original --
*From: * "Cheng Lian-3 [via Apache Spark Developers List]";<[hidden
email] >;
*Send time:* Thursday, Feb 23, 2017 9:43 AM
*To:* "Stan Zhai"<[hidden email]
>;
*Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0
Just from the th
Congratulations!!!
Cheng
On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin wrote:
> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
JIRA: https://issues.apache.org/jira/browse/SPARK-18403
PR: https://github.com/apache/spark/pull/15845
Will merge it as soon as Jenkins passes.
Cheng
On 11/10/16 11:30 AM, Dongjoon Hyun wrote:
Great! Thank you so much, Cheng!
Bests,
Dongjoon.
On 2016-11-10 11:21 (-0800), Cheng Lian
Hey Dongjoon,
Thanks for reporting. I'm looking into these OOM errors. Already
reproduced them locally but haven't figured out the root cause yet.
Gonna disable them temporarily for now.
Sorry for the inconvenience!
Cheng
On 11/10/16 8:48 AM, Dongjoon Hyun wrote:
Hi, All.
Recently, I
Finished reviewing the list and it LGTM now (left comments in the
spreadsheet and Ryan already made corresponding changes).
Ryan - Thanks a lot for pushing this and making it happen!
Cheng
On 1/6/17 3:46 PM, Ryan Blue wrote:
Last month, there was interest in a Parquet patch release on PR
+1
On 10/12/17 20:10, Liwei Lin wrote:
+1 !
Cheers,
Liwei
On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote:
+1
Regards,
Vaquar khan
On Oct 11, 2017 10:14 PM, "Weichen Xu"
+1 (binding)
Passed all the tests, looks good.
Cheng
On 2/23/18 15:00, Holden Karau wrote:
+1 (binding)
PySpark artifacts install in a fresh Py3 virtual env
On Feb 23, 2018 7:55 AM, "Denny Lee" > wrote:
+1 (non-binding)
On
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...
Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:
- Hadoop 3.2
- Hive 2.3
- JDK 11
We have already found and fixed some feature and performance
Cc Yuming, Steve, and Dongjoon
On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian wrote:
> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>
.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 20
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring
/Hive versions in Spark 3.0, I personally do not
have a preference as long as the above two are met.
On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian wrote:
> Dongjoon, I don't think we have any conflicts here. As stated in other
> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 v
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile
or not.
On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian wrote:
> Just to summarize my points:
>
>1. Let's still keep the Hive 1.2 dependency
n't want to interact with this
> Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>
> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
> default Hive 2.3 pom at least?
> How do you think about that way, Cheng?
>
> Bests,
> Dongjoon
adoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>>
Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.
Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:
> can we remove the usage of forked `hive` in
Hey Dongjoon and Felix,
I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
However, *"Hive" and "Hive integration in Spark" are two quite different
things*, and I don't think anybody has ever mentioned "the
been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian wrote:
>&
It's kinda like Scala version upgrade. Historically, we only remove the
support of an older Scala version when the newer version is proven to be
stable after one or more Spark minor versions.
On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian wrote:
> Hmm, what exactly did you mean by "remove t
;
>> ------
>> *From:* Steve Loughran
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian
>> *Cc:* Sean Owen ; Wenchen Fan ;
>> Dongjoon Hyun ; dev ;
>> Yuming Wang
>> *Subject:* Re: Use Hadoop-3.2 as a
97 matches
Mail list logo