Suggest to workaround the org.eclipse.jetty.orbit problem with SBT 0.13.2-RC1

2014-03-25 Thread Cheng Lian
Hi all, Due to a bug https://issues.apache.org/jira/browse/IVY-899 of Ivy, SBT tries to download .orbit instead of .jar files and causing problems. This bug has been fixed in Ivy 2.3.0, but SBT 0.13.1 still uses Ivy 2.0. Aaron had kindly provided a workaround in PR

Re: new JDBC server test cases seems failed ?

2014-07-28 Thread Cheng Lian
profile), and why once the build fail, all JDBC suites fail together. Working on a patch to fix this. Thanks to Patrick for helping debugging this! On Jul 28, 2014, at 10:07 AM, Cheng Lian l...@databricks.com wrote: I’m looking into this, will fix this ASAP, sorry for the inconvenience

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Cheng Lian
AFAIK, according a recent talk, Hulu team in China has built Spark SQL against Hive 0.13 (or 0.13.1?) successfully. Basically they also re-packaged Hive 0.13 as what the Spark team did. The slides of the talk hasn't been released yet though. On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Cheng Lian
, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com wrote: AFAIK, according a recent talk, Hulu team in China has built Spark SQL against Hive 0.13 (or 0.13.1?) successfully. Basically they also re-packaged Hive 0.13 as what the Spark team did. The slides of the talk hasn't been

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Cheng Lian
It’s also useful to set hive.exec.mode.local.auto to true to accelerate the test. ​ On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com wrote: It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI:

Re: spark-shell is broken! (bad option: '--master')

2014-08-08 Thread Cheng Lian
Just opened a PR based on the branch Patrick mentioned for this issue https://github.com/apache/spark/pull/1864 On Sat, Aug 9, 2014 at 6:48 AM, Patrick Wendell pwend...@gmail.com wrote: Cheng Lian also has a fix for this. I've asked him to make a PR - he is on China time so it probably won't

Re: [sql]enable spark sql cli support spark sql

2014-08-14 Thread Cheng Lian
In the long run, as Michael suggested in his Spark Summit 14 talk, we’d like to implement SQL-92, maybe with the help of Optiq. On Aug 15, 2014, at 1:13 PM, Cheng, Hao hao.ch...@intel.com wrote: Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and only support some

Re: mvn test error

2014-08-18 Thread Cheng Lian
The exception indicates that the forked process doesn’t executed as expected, thus the test case *should* fail. Instead of replacing the exception with a logWarning, capturing and printing stdout/stderr of the forked process can be helpful for diagnosis. Currently the only information we have at

Re: mvn test error

2014-08-19 Thread Cheng Lian
://github.com/apache/spark/pull/1856/files On Tue, Aug 19, 2014 at 2:55 PM, scwf wangf...@huawei.com wrote: hi,Cheng Lian thanks, printing stdout/stderr of the forked process is more reasonable. On 2014/8/19 13:35, Cheng Lian wrote: The exception indicates that the forked process doesn’t executed

Re: RDD replication in Spark

2014-08-27 Thread Cheng Lian
You may start from here https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712 . ​ On Mon, Aug 25, 2014 at 9:05 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I've exercised multiple

Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-27 Thread Cheng Lian
I believe in your case, the “magic” happens in TableReader.fillObject https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712. Here we unwrap the field value according to the object inspector of that

Re: deleted: sql/hive/src/test/resources/golden/case sensitivity on windows

2014-08-28 Thread Cheng Lian
Colon is not allowed to be part of a Windows file name and I think Git just cannot create this file while cloning. Remove the colon in the name string of this test case

Re: Jira tickets for starter tasks

2014-08-28 Thread Cheng Lian
You can just start the work :) On Thu, Aug 28, 2014 at 3:52 PM, Bill Bejeck bbej...@gmail.com wrote: Hi, How do I get a starter task jira ticket assigned to myself? Or do I just do the work and issue a pull request with the associated jira number? Thanks, Bill

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-28 Thread Cheng Lian
+1. Tested Spark SQL Thrift server and CLI against a single node standalone cluster. On Thu, Aug 28, 2014 at 9:27 PM, Timothy Chen tnac...@gmail.com wrote: +1 Make-distrubtion works, and also tested simple spark jobs on Spark on Mesos on 8 node Mesos cluster. Tim On Thu, Aug 28, 2014 at

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Cheng Lian
Just noticed one thing: although --with-hive is deprecated by -Phive, make-distribution.sh still relies on $SPARK_HIVE (which was controlled by --with-hive) to determine whether to include datanucleus jar files. This means we have to do something like SPARK_HIVE=true ./make-distribution.sh ... to

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins :) On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com wrote: Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Cheng Lian
+1. Tested locally on OSX 10.9, built with Hadoop 2.4.1 - Checked Datanucleus jar files - Tested Spark SQL Thrift server and CLI under local mode and standalone cluster against MySQL backed metastore On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen rosenvi...@gmail.com wrote: +1. Tested on

Re: Question about SparkSQL and Hive-on-Spark

2014-09-24 Thread Cheng Lian
I don’t think so. For example, we’ve already added extended syntax like CACHE TABLE. ​ On Wed, Sep 24, 2014 at 3:27 PM, Yi Tian tianyi.asiai...@gmail.com wrote: Hi Reynold! Will sparkSQL strictly obey the HQL syntax ? For example, the cube function. In other words, the hiveContext of

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: Extending Scala style checks

2014-10-01 Thread Cheng Lian
Since we can easily catch the list of all changed files in a PR, I think we can start with adding the no trailing space check for newly changed files only? On 10/2/14 9:24 AM, Nicholas Chammas wrote: Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation error just now: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. However, /usr/bin/javac -version says:

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
It's a new pull request builder written by Josh, integrated into our state-of-the-art PR dashboard :) On 10/21/14 9:33 PM, Nan Zhu wrote: just curious…what is this “NewSparkPullRequestBuilder”? Best, -- Nan Zhu On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote: Hm, seems that 7u71

Re: HiveContext bug?

2014-10-28 Thread Cheng Lian
Hi Marcelo, yes this is a known Spark SQL bug and we've got PRs to fix it (2887 2967). Not merged yet because newly merged Hive 0.13.1 support causes some conflicts. Thanks for reporting this :) On Tue, Oct 28, 2014 at 6:41 AM, Marcelo Vanzin van...@cloudera.com wrote: Well, looks like a huge

Re: best IDE for scala + spark development?

2014-10-28 Thread Cheng Lian
My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac compatibility bug months ago, and you may want to use the most recent version here https://github.com/scala/scala-dist/blob/master/tool-support/src/emacs/contrib/dot-ctags On Tue, Oct 28, 2014 at 4:26 PM, Duy Huynh

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
Yes, these two combinations work for me. On 10/29/14 12:32 PM, Zhan Zhang wrote: -Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is to enable hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but expected to go to upstream soon (Spark-3720). Thanks. Zhan

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
maven - which has always been working already. Do you have instructions for building in IJ? 2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com: Yes, these two combinations work for me. On 10/29/14 12:32 PM, Zhan Zhang wrote: -Phive

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
java...@gmail.com wrote: I am interested specifically in how to build (and hopefully run/debug..) under Intellij. Your posts sound like command line maven - which has always been working already. Do you have instructions for building in IJ? 2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
profiles. I was able to run spark core tests from within IntelliJ. Didn't try anything beyond that, but FWIW this worked. - Patrick On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian lian.cs@gmail.com wrote: You may first open the root pom.xml file in IDEA, and then go for menu View / Tool Windows

Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Cheng Lian
I often see this when I first build the whole Spark project with SBT, then modify some code and tries to build and debug within IDEA, or vice versa. A clean rebuild can always solve this. On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell pwend...@gmail.com wrote: Does this happen if you clean

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng Lian
+1 since this is already the de facto model we are using. On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote: +1 发自我的 iPhone 在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道: +1 great idea. On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:

Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian
Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Cheng Lian
.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2Fsi=6222577584832512pi=626685a9-b628-43cc-91a1-93636171ce77 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414084656759_0142 On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-14 Thread Cheng Lian
+1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday

Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I’m not an

Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users. On 11/22/14 11:05 PM, Cheng Lian wrote

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs:

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM,

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: Spark SQL, Hive Parquet data types

2015-02-20 Thread Cheng Lian
For the second question, we do plan to support Hive 0.14, possibly in Spark 1.4.0. For the first question: 1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp type, so you can’t. 2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its own Parquet support

Re: Get size of rdd in memory

2015-01-30 Thread Cheng Lian
Here is a toy |spark-shell| session snippet that can show the memory consumption difference: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ setConf(spark.sql.shuffle.partitions,1) case class KV(key:Int, value:String)

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Cheng Lian
Hi Aniket, In general the schema of all rows in a single table must be same. This is a basic assumption made by Spark SQL. Schema union does make sense, and we're planning to support this for Parquet. But as you've mentioned, it doesn't help if types of different versions of a column differ

Re: [SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Cheng Lian
Talked with Yi offline, personally I think this feature is pretty useful, and the design makes sense, and he's already got a running prototype. Yi, would you mind to open a PR for this? Thanks! Cheng On 1/6/15 5:25 PM, Yi Tian wrote: Hi, all I have create a JIRA ticket about adding a

Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian
Oh sorry, I misread your question. I thought you were trying something like |parquetFile(“s3n://file1,hdfs://file2”)|. Yeah, it’s a valid bug. Thanks for opening the JIRA ticket and the PR! Cheng On 3/16/15 6:39 PM, Cheng Lian wrote: Hi Pei-Lun, We intentionally disallowed passing

Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian
Hi Pei-Lun, We intentionally disallowed passing multiple comma separated paths in 1.3.0. One of the reason is that users report that this fail when a file path contain an actual comma in it. In your case, you may do something like this: |val s3nDF = parquetFile(s3n://...) val hdfsDF =

Wrong version on the Spark documentation page

2015-03-15 Thread Cheng Lian
It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/ But this page is updated (1.3.0) http://spark.apache.org/docs/latest/index.html Cheng - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Cheng Lian
on the same ShuffledRDD. I think only shuffle write which generates shuffle files will have chance to meet name conflicts, multiple times of shuffle read is acceptable as the code snippet shows. Thanks Jerry -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent

Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose

Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to

Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark

Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Ah, sorry for not being clear enough. So now in Spark 1.3.0, we have two Parquet support implementations, the old one is tightly coupled with the Spark SQL framework, while the new one is based on data sources API. In both versions, we try to intercept operations over Parquet tables

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't got time to get it merged. Considering the release is close, I can cherry-pick those Hive 12 fixes from #4107 and open a more surgical PR soon. Cheng On 2/24/15 4:18 AM, Michael Armbrust wrote: On Sun, Feb 22, 2015 at

Re: number of partitions for hive schemaRDD

2015-02-26 Thread Cheng Lian
Hi Masaki, I guess what you saw is the partition number of the last stage, which must be 1 to perform the global phase of LIMIT. To tune partition number of normal shuffles like joins, you may resort to spark.sql.shuffle.partitions. Cheng On 2/26/15 5:31 PM, masaki rikitoku wrote: Hi all

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala. On 1/29/15 1:59 PM, Cheng Lian wrote: Yes, when a DataFrame is cached in memory, it's

Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian
Thanks for reporting this! Would you mind to open JIRA tickets for both Spark and Parquet? I'm not sure whether Parquet declares somewhere the user mustn't reuse byte arrays when using binary type. If it does, then it's a Spark bug. Anyway, this should be fixed. Cheng On 4/12/15 1:50 PM,

Re: IntelliJ Runtime error

2015-04-04 Thread Cheng Lian
I found in general it's a pain to build/run Spark inside IntelliJ IDEA. I guess most people resort to this approach so that they can leverage the integrated debugger to debug and/or learn Spark internals. A more convenient way I'm using recently is resorting to the remote debugging feature. In

Re: About akka used in spark

2015-06-10 Thread Cheng Lian
We only shaded protobuf dependencies because of compatibility issues. The source code is not modified. On 6/10/15 1:55 PM, wangtao (A) wrote: Hi guys, I see group id of akka used in spark is “org.spark-project.akka”. What is its difference with the typesafe one? What is its version? And

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian
download from driver and setup classpath Right? But somehow, the first step fails. Even if I can make the first step works(use option1), it seems that the classpath in driver is not correctly set. Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 2:32

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
Since the jars are already on HDFS, you can access them directly in your Spark application without using --jars Cheng On 6/11/15 11:04 AM, Dong Lei wrote: Hi spark-dev: I can not use a hdfs location for the “--jars” or “--files” option when doing a spark-submit in a standalone cluster

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Cheng Lian
the driver will not need to setup a HTTP file server for this scenario and the worker will fetch the jars and files from HDFS? Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 12:50 PM *To:* Dong Lei; dev@spark.apache.org *Cc:* Dianfei (Keith) Han *Subject

Re: possible issues with listing objects in the HadoopFSrelation

2015-08-12 Thread Cheng Lian
Hi Gil, Sorry for the late reply and thanks for raising this question. The file listing logic in HadoopFsRelation is intentionally made different from Hadoop FileInputFormat. Here are the reasons: 1. Efficiency: when computing RDD partitions, FileInputFormat.listStatus() is called on the

Deleted unreleased version 1.6.0 from JIRA by mistake

2015-07-22 Thread Cheng Lian
Hi all, The unreleased version 1.6.0 has was removed from JIRA due to my misoperation. I've added it back, but JIRA tickets that once targeted to 1.6.0 now have empty target version/s. If you found tickets that should have targeted to 1.6.0, please help marking the target version/s field

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but

Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Cheng Lian
Yeah, two of the reasons why the built-in in-memory columnar storage doesn't achieve comparable compression ratio as Parquet are: 1. The in-memory columnar representation doesn't handle nested types. So array/map/struct values are not compressed. 2. Parquet may use more than one kind of

Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-10 Thread Cheng Lian
Hi Shane, I found that Jenkins has been in the status of "Jenkins is going to shut down" for at least 4 hours (from ~23:30 Dec 9 to 3:45 Dec 10, PDT). Not sure whether this is part of the schedule or related? Cheng On Thu, Dec 10, 2015 at 3:56 AM, shane knapp wrote: >

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-26 Thread Cheng Lian
+1 On 12/23/15 12:39 PM, Yin Huai wrote: +1 On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee > wrote: +1 On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson > wrote: +1 On

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
Hey Pedro, SQL programming guide is being updated. Here's the PR, but not merged yet: https://github.com/apache/spark/pull/13592 Cheng On 6/17/16 9:13 PM, Pedro Rodriguez wrote: Hi All, At my workplace we are starting to use Datasets in 1.6.1 and even more with Spark 2.0 in place of

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
. Should I take discussion to your PR? Pedro On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Hey Pedro, SQL programming guide is being updated. Here's the PR, but not merged yet: https://github.com/apache/spar

Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian
Awesome! Congrats and welcome!! On 2/9/16 2:55 AM, Shixiong(Ryan) Zhu wrote: Congrats!!! Herman and Wenchen!!! On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote: On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia

Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian
Awesome! Congrats and welcome!! Cheng On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu wrote: > Congrats!!! Herman and Wenchen!!! > > > On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote: > >> >> >> On Mon, Feb 8, 2016 at 9:15 AM, Matei

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Cheng Lian
Sorry for being late, I'm building a Spark branch based on the most recent master to test out 1.8.2-rc1, will post my result here ASAP. Cheng On 1/23/17 11:43 AM, Julien Le Dem wrote: Hi Spark dev, Here is the voting thread for parquet 1.8.2 release. Cheng or someone else we would appreciate

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-23 Thread Cheng Lian
? -- Original -- *From: * "Cheng Lian-3 [via Apache Spark Developers List]";<[hidden email] >; *Send time:* Thursday, Feb 23, 2017 9:43 AM *To:* "Stan Zhai"<[hidden email] >; *Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0 Just from the th

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Cheng Lian
Congratulations!!! Cheng On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin wrote: > Hi all, > > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark > committer. Xiao has been a super active contributor to Spark SQL. Congrats > and welcome, Xiao! > > - Reynold >

Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian
JIRA: https://issues.apache.org/jira/browse/SPARK-18403 PR: https://github.com/apache/spark/pull/15845 Will merge it as soon as Jenkins passes. Cheng On 11/10/16 11:30 AM, Dongjoon Hyun wrote: Great! Thank you so much, Cheng! Bests, Dongjoon. On 2016-11-10 11:21 (-0800), Cheng Lian

Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian
Hey Dongjoon, Thanks for reporting. I'm looking into these OOM errors. Already reproduced them locally but haven't figured out the root cause yet. Gonna disable them temporarily for now. Sorry for the inconvenience! Cheng On 11/10/16 8:48 AM, Dongjoon Hyun wrote: Hi, All. Recently, I

Re: Parquet patch release

2017-01-09 Thread Cheng Lian
Finished reviewing the list and it LGTM now (left comments in the spreadsheet and Ryan already made corresponding changes). Ryan - Thanks a lot for pushing this and making it happen! Cheng On 1/6/17 3:46 PM, Ryan Blue wrote: Last month, there was interest in a Parquet patch release on PR

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Cheng Lian
+1 On 10/12/17 20:10, Liwei Lin wrote: +1 ! Cheers, Liwei On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote: +1 Regards, Vaquar khan On Oct 11, 2017 10:14 PM, "Weichen Xu"

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Cheng Lian
+1 (binding) Passed all the tests, looks good. Cheng On 2/23/18 15:00, Holden Karau wrote: +1 (binding) PySpark artifacts install in a fresh Py3 virtual env On Feb 23, 2018 7:55 AM, "Denny Lee" > wrote: +1 (non-binding) On

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong here...

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.: - Hadoop 3.2 - Hive 2.3 - JDK 11 We have already found and fixed some feature and performance

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Cc Yuming, Steve, and Dongjoon On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian wrote: > Similar to Xiao, my major concern about making Hadoop 3.2 the default > Hadoop version is quality control. The current hadoop-3.2 profile covers > too many major component upgrades, i.e.: > >

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
.3` pre-built distribution, how do > you think about this, Sean? > The preparation is already started in another email thread and I believe > that is a keystone to prove `Hive 2.3` version stability > (which Cheng/Hyukjin/you asked). > > Bests, > Dongjoon. > > > On Tue, Nov 19, 20

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
/Hive versions in Spark 3.0, I personally do not have a preference as long as the above two are met. On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian wrote: > Dongjoon, I don't think we have any conflicts here. As stated in other > threads multiple times, as long as Hive 2.3 and Hadoop 3.2 v

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not. On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian wrote: > Just to summarize my points: > >1. Let's still keep the Hive 1.2 dependency

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
n't want to interact with this > Hive 1.2 fork, they can always use Hive 2.3 at their own risks. > > Specifically, what about having a profile `hive-1.2` at `3.0.0` with the > default Hive 2.3 pom at least? > How do you think about that way, Cheng? > > Bests, > Dongjoon

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
adoop-3.2 > profile. > > What do you mean by "only meaningful under the hadoop-3.2 profile"? > > On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote: > >> Hey Steve, >> >> In terms of Maven artifact, I don't think the default Hadoop version >>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version. Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part: > can we remove the usage of forked `hive` in

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Hey Dongjoon and Felix, I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0. However, *"Hive" and "Hive integration in Spark" are two quite different things*, and I don't think anybody has ever mentioned "the

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
been reluctant to (1) and (2) due to its burden. >> But, it's time to prepare. Without them, we are going to be insufficient >> again and again. >> >> Bests, >> Dongjoon. >> >> >> >> >> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian wrote: >&

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions. On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian wrote: > Hmm, what exactly did you mean by "remove t

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian
; >> ------ >> *From:* Steve Loughran >> *Sent:* Sunday, November 17, 2019 9:22:09 AM >> *To:* Cheng Lian >> *Cc:* Sean Owen ; Wenchen Fan ; >> Dongjoon Hyun ; dev ; >> Yuming Wang >> *Subject:* Re: Use Hadoop-3.2 as a