Suggest to workaround the org.eclipse.jetty.orbit problem with SBT 0.13.2-RC1

2014-03-25 Thread Cheng Lian
Hi all,

Due to a bug https://issues.apache.org/jira/browse/IVY-899 of Ivy, SBT
tries to download .orbit instead of .jar files and causing problems. This
bug has been fixed in Ivy 2.3.0, but SBT 0.13.1 still uses Ivy 2.0. Aaron
had kindly provided a workaround in PR
#183https://github.com/apache/incubator-spark/pull/183,
but I'm afraid only explicitly depend on javax.servlet only is not enough.
I'm not pretty sure about this because I'm facing both this issue and
ridiculously unstable network environment, which makes reproducing the bug
extremely time consuming (sbt gen-idea costs at least half an hour to
complete, and the generated result is broken. Most of the time was spent in
dependency resolution).

At last, I worked around this issue by updating my local SBT to 0.13.2-RC1.
If any of you are experiencing similar problem, I suggest you upgrade your
local SBT version. Since SBT 0.13.2-RC1 is not an official release, we have
to build it from scratch:

$ git clone g...@github.com:sbt/sbt.git sbt-0.13.2-rc1-src-home
$ cd sbt-0.13.2-rc1-src-home
$ git checkout v0.13.2-RC1

Now ensure you have SBT 0.13.1 installed as the latest stable version is
required for bootstrapping:

$ sbt publishLocal
$ mv ~/.sbt/boot /tmp
$ cd sbt-0.13.1-home/bin
$ mv sbt-launch.jar sbt-launch-0.13.1.jar
$ ln -sf sbt-0.13.2-rc1-src-home/target/sbt-launch-0.13.2-RC1.jar
sbt-launch.jar

Now you should be able to build Spark without worrying .orbit files. Hope
it helps.

Cheng


Re: new JDBC server test cases seems failed ?

2014-07-28 Thread Cheng Lian
Noticed that Nan’s PR is not related to SQL, but the JDBC test suites got 
executed. Then I checked PRs of all those Jenkins builds that failed because of 
the JDBC suites, it turns out that none of them touched SQL code.  The JDBC 
code is only contained in the assembly file when the hive-thriftserver build 
profile is enabled. So it seems that the root cause is related to Maven build 
changes that makes the JDBC suites always get executed and fail because JDBC 
code isn't included in the assembly jar. This also explains why I can’t 
reproduce it locally (I always enable hive-thriftserver profile), and why once 
the build fail, all JDBC suites fail together.

Working on a patch to fix this. Thanks to Patrick for helping debugging this!

On Jul 28, 2014, at 10:07 AM, Cheng Lian l...@databricks.com wrote:

 I’m looking into this, will fix this ASAP, sorry for the inconvenience.
 
 On Jul 28, 2014, at 9:47 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 I'm going to revert it again - Cheng can you try to look into this? Thanks.
 
 On Sun, Jul 27, 2014 at 6:06 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 it's 20 minutes ago
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17259/consoleFull
 
 --
 Nan Zhu
 
 
 On Sunday, July 27, 2014 at 8:53 PM, Michael Armbrust wrote:
 
 How recent is this? We've already reverted this patch once due to failing
 tests. It would be helpful to include a link to the failed build. If its
 failing again we'll have to revert again.
 
 
 On Sun, Jul 27, 2014 at 5:26 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
 
 Hi, all
 
 It seems that the JDBC test cases are failed unexpectedly in Jenkins?
 
 
 [info] - test query execution against a Hive Thrift server *** FAILED ***
 [info] java.sql.SQLException: Could not open connection to
 jdbc:hive2://localhost:45518/: java.net.ConnectException: Connection
 refused [info] at
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
 [info] at
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info]
 at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110)
 [info] at org.apache.spark.sql.hive.thri
 ftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
 [info] ... [info] Cause: org.apache.thrift.transport.TTransportException:
 java.net.ConnectException: Connection refused [info] at
 org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at
 org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
 [info] at
 org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
 [info] at
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
 [info] at
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info]
 at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at
 java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at
 org.apache.spark.sql.hive.thriftserver.H
 iveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
 [info] at
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
 [info] ... [info] Cause: java.net.ConnectException: Connection refused
 [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
 [info] at
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 [info] at
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
 [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
 [info]
 at java.net.Socket.connect(Socket.java:579) [info] at
 org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at
 org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
 [info] at
 org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
 [info] at org.apache.hive.jdbc.HiveConn
 ection.openTransport(HiveConnection.java:144) [info] ... [info] CliSuite:
 Executing: create table hive_test1(key int, val string);, expecting 
 output:
 OK [warn] four warnings found [warn] Note:
 /home/jenkins/workspace/SparkPullRequestBuilder@4/core

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Cheng Lian
AFAIK, according a recent talk, Hulu team in China has built Spark SQL
against Hive 0.13 (or 0.13.1?) successfully. Basically they also
re-packaged Hive 0.13 as what the Spark team did. The slides of the talk
hasn't been released yet though.


On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com wrote:

 Owen helped me find this:
 https://issues.apache.org/jira/browse/HIVE-7423

 I guess this means that for Hive 0.14, Spark should be able to directly
 pull in hive-exec-core.jar

 Cheers


 On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  It would be great if the hive team can fix that issue. If not, we'll
  have to continue forking our own version of Hive to change the way it
  publishes artifacts.
 
  - Patrick
 
  On Mon, Jul 28, 2014 at 9:34 AM, Ted Yu yuzhih...@gmail.com wrote:
   Talked with Owen offline. He confirmed that as of 0.13, hive-exec is
  still
   uber jar.
  
   Right now I am facing the following error building against Hive 0.13.1
 :
  
   [ERROR] Failed to execute goal on project spark-hive_2.10: Could not
   resolve dependencies for project
   org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The following
   artifacts could not be resolved:
   org.spark-project.hive:hive-metastore:jar:0.13.1,
   org.spark-project.hive:hive-exec:jar:0.13.1,
   org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
   org.spark-project.hive:hive-metastore:jar:0.13.1 in
   http://repo.maven.apache.org/maven2 was cached in the local
 repository,
   resolution will not be reattempted until the update interval of
  maven-repo
   has elapsed or updates are forced - [Help 1]
  
   Some hint would be appreciated.
  
   Cheers
  
  
   On Mon, Jul 28, 2014 at 9:15 AM, Sean Owen so...@cloudera.com wrote:
  
   Yes, it is published. As of previous versions, at least, hive-exec
   included all of its dependencies *in its artifact*, making it unusable
   as-is because it contained copies of dependencies that clash with
   versions present in other artifacts, and can't be managed with Maven
   mechanisms.
  
   I am not sure why hive-exec was not published normally, with just its
   own classes. That's why it was copied, into an artifact with just
   hive-exec code.
  
   You could do the same thing for hive-exec 0.13.1.
   Or maybe someone knows that it's published more 'normally' now.
   I don't think hive-metastore is related to this question?
  
   I am no expert on the Hive artifacts, just remembering what the issue
   was initially in case it helps you get to a similar solution.
  
   On Mon, Jul 28, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com wrote:
hive-exec (as of 0.13.1) is published here:
   
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-exec%7C0.13.1%7Cjar
   
Should a JIRA be opened so that dependency on hive-metastore can be
replaced by dependency on hive-exec ?
   
Cheers
   
   
On Mon, Jul 28, 2014 at 8:26 AM, Sean Owen so...@cloudera.com
  wrote:
   
The reason for org.spark-project.hive is that Spark relies on
hive-exec, but the Hive project does not publish this artifact by
itself, only with all its dependencies as an uber jar. Maybe that's
been improved. If so, you need to point at the new hive-exec and
perhaps sort out its dependencies manually in your build.
   
On Mon, Jul 28, 2014 at 4:01 PM, Ted Yu yuzhih...@gmail.com
 wrote:
 I found 0.13.1 artifacts in maven:

   
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-metastore%7C0.13.1%7Cjar

 However, Spark uses groupId of org.spark-project.hive, not
org.apache.hive

 Can someone tell me how it is supposed to work ?

 Cheers


 On Mon, Jul 28, 2014 at 7:44 AM, Steve Nunez 
  snu...@hortonworks.com
wrote:

 I saw a note earlier, perhaps on the user list, that at least
 one
person is
 using Hive 0.13. Anyone got a working build configuration for
 this
version
 of Hive?

 Regards,
 - Steve



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual
 or
entity to
 which it is addressed and may contain information that is
   confidential,
 privileged and exempt from disclosure under applicable law. If
 the
reader
 of this message is not the intended recipient, you are hereby
   notified
that
 any printing, copying, dissemination, distribution, disclosure
 or
 forwarding of this communication is strictly prohibited. If you
  have
 received this communication in error, please contact the sender
immediately
 and delete it from your system. Thank You.

   
  
 



Re: Working Formula for Hive 0.13?

2014-07-28 Thread Cheng Lian
Exactly, forgot to mention Hulu team also made changes to cope with those
incompatibility issues, but they said that’s relatively easy once the
re-packaging work is done.


On Tue, Jul 29, 2014 at 1:20 AM, Patrick Wendell pwend...@gmail.com wrote:

 I've heard from Cloudera that there were hive internal changes between
 0.12 and 0.13 that required code re-writing. Over time it might be
 possible for us to integrate with hive using API's that are more
 stable (this is the domain of Michael/Cheng/Yin more than me!). It
 would be interesting to see what the Hulu folks did.

 - Patrick

 On Mon, Jul 28, 2014 at 10:16 AM, Cheng Lian lian.cs@gmail.com
 wrote:
  AFAIK, according a recent talk, Hulu team in China has built Spark SQL
  against Hive 0.13 (or 0.13.1?) successfully. Basically they also
  re-packaged Hive 0.13 as what the Spark team did. The slides of the talk
  hasn't been released yet though.
 
 
  On Tue, Jul 29, 2014 at 1:01 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Owen helped me find this:
  https://issues.apache.org/jira/browse/HIVE-7423
 
  I guess this means that for Hive 0.14, Spark should be able to directly
  pull in hive-exec-core.jar
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   It would be great if the hive team can fix that issue. If not, we'll
   have to continue forking our own version of Hive to change the way it
   publishes artifacts.
  
   - Patrick
  
   On Mon, Jul 28, 2014 at 9:34 AM, Ted Yu yuzhih...@gmail.com wrote:
Talked with Owen offline. He confirmed that as of 0.13, hive-exec is
   still
uber jar.
   
Right now I am facing the following error building against Hive
 0.13.1
  :
   
[ERROR] Failed to execute goal on project spark-hive_2.10: Could not
resolve dependencies for project
org.apache.spark:spark-hive_2.10:jar:1.1.0-SNAPSHOT: The following
artifacts could not be resolved:
org.spark-project.hive:hive-metastore:jar:0.13.1,
org.spark-project.hive:hive-exec:jar:0.13.1,
org.spark-project.hive:hive-serde:jar:0.13.1: Failure to find
org.spark-project.hive:hive-metastore:jar:0.13.1 in
http://repo.maven.apache.org/maven2 was cached in the local
  repository,
resolution will not be reattempted until the update interval of
   maven-repo
has elapsed or updates are forced - [Help 1]
   
Some hint would be appreciated.
   
Cheers
   
   
On Mon, Jul 28, 2014 at 9:15 AM, Sean Owen so...@cloudera.com
 wrote:
   
Yes, it is published. As of previous versions, at least, hive-exec
included all of its dependencies *in its artifact*, making it
 unusable
as-is because it contained copies of dependencies that clash with
versions present in other artifacts, and can't be managed with
 Maven
mechanisms.
   
I am not sure why hive-exec was not published normally, with just
 its
own classes. That's why it was copied, into an artifact with just
hive-exec code.
   
You could do the same thing for hive-exec 0.13.1.
Or maybe someone knows that it's published more 'normally' now.
I don't think hive-metastore is related to this question?
   
I am no expert on the Hive artifacts, just remembering what the
 issue
was initially in case it helps you get to a similar solution.
   
On Mon, Jul 28, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com
 wrote:
 hive-exec (as of 0.13.1) is published here:

   
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-exec%7C0.13.1%7Cjar

 Should a JIRA be opened so that dependency on hive-metastore can
 be
 replaced by dependency on hive-exec ?

 Cheers


 On Mon, Jul 28, 2014 at 8:26 AM, Sean Owen so...@cloudera.com
   wrote:

 The reason for org.spark-project.hive is that Spark relies on
 hive-exec, but the Hive project does not publish this artifact
 by
 itself, only with all its dependencies as an uber jar. Maybe
 that's
 been improved. If so, you need to point at the new hive-exec and
 perhaps sort out its dependencies manually in your build.

 On Mon, Jul 28, 2014 at 4:01 PM, Ted Yu yuzhih...@gmail.com
  wrote:
  I found 0.13.1 artifacts in maven:
 

   
  
 
 http://search.maven.org/#artifactdetails%7Corg.apache.hive%7Chive-metastore%7C0.13.1%7Cjar
 
  However, Spark uses groupId of org.spark-project.hive, not
 org.apache.hive
 
  Can someone tell me how it is supposed to work ?
 
  Cheers
 
 
  On Mon, Jul 28, 2014 at 7:44 AM, Steve Nunez 
   snu...@hortonworks.com
 wrote:
 
  I saw a note earlier, perhaps on the user list, that at least
  one
 person is
  using Hive 0.13. Anyone got a working build configuration for
  this
 version
  of Hive?
 
  Regards,
  - Steve
 
 
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the
 individual
  or
 entity

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Cheng Lian
It’s also useful to set hive.exec.mode.local.auto to true to accelerate the
test.
​


On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com
wrote:

 
  It seems that the HiveCompatibilitySuite need a hadoop and hive
  environment, am I right?
 
  Relative path in absolute URI:
  file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”
 

 You should only need Hadoop and Hive if you are creating new tests that we
 need to compute the answers for.  Existing tests are run with cached
 answers.  There are details about the configuration here:
 https://github.com/apache/spark/tree/master/sql



Re: spark-shell is broken! (bad option: '--master')

2014-08-08 Thread Cheng Lian
Just opened a PR based on the branch Patrick mentioned for this issue
https://github.com/apache/spark/pull/1864


On Sat, Aug 9, 2014 at 6:48 AM, Patrick Wendell pwend...@gmail.com wrote:

 Cheng Lian also has a fix for this. I've asked him to make a PR - he
 is on China time so it probably won't come until tonight:


 https://github.com/liancheng/spark/compare/apache:master...liancheng:spark-2894

 On Fri, Aug 8, 2014 at 3:46 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  Hi Chutium,
 
  This is currently being addressed in
  https://github.com/apache/spark/pull/1825
 
  -Sandy
 
 
  On Fri, Aug 8, 2014 at 2:26 PM, chutium teng@gmail.com wrote:
 
  no one use spark-shell in master branch?
 
  i created a PR as follow up commit of SPARK-2678 and PR #1801:
 
  https://github.com/apache/spark/pull/1861
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-is-broken-bad-option-master-tp7778p7780.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [sql]enable spark sql cli support spark sql

2014-08-14 Thread Cheng Lian
In the long run, as Michael suggested in his Spark Summit 14 talk, we’d like to 
implement SQL-92, maybe with the help of Optiq.

On Aug 15, 2014, at 1:13 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and 
 only support some basic queries, not sure what's the plan for its enhancement.
 
 -Original Message-
 From: scwf [mailto:wangf...@huawei.com] 
 Sent: Friday, August 15, 2014 11:22 AM
 To: dev@spark.apache.org
 Subject: [sql]enable spark sql cli support spark sql
 
 hi all,
   now spark sql cli only support spark hql, i think we can enable this cli to 
 support spark sql, do you think it's necessary?
 
 -- 
 
 Best Regards
 Fei Wang
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
 commands, e-mail: dev-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mvn test error

2014-08-18 Thread Cheng Lian
The exception indicates that the forked process doesn’t executed as
expected, thus the test case *should* fail.

Instead of replacing the exception with a logWarning, capturing and
printing stdout/stderr of the forked process can be helpful for diagnosis.
Currently the only information we have at hand is the process exit code,
it’s hard to determine the reason why the forked process fails.
​


On Tue, Aug 19, 2014 at 1:27 PM, scwf wangf...@huawei.com wrote:

 hi, all
   I notice that jenkins may also throw this error when running tests(
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/
 consoleFull).


 This is because in Utils.executeAndGetOutput our progress exitCode is not
 0, may be we should logWarning here rather than throw a exception?

 Utils.executeAndGetOutput {
 val exitCode = process.waitFor()
 stdoutThread.join()   // Wait for it to finish reading output
 if (exitCode != 0) {
   throw new SparkException(Process  + command +  exited with code 
 + exitCode)
 }
 }

 any idea?



 On 2014/8/15 11:01, scwf wrote:

 env: ubuntu 14.04 + spark master buranch

 mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
 package

 mvn -Pyarn -Phadoop-2.4 -Phive test

 test error:

 DriverSuite:
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 - driver should exit after finishing *** FAILED ***
SparkException was thrown during property evaluation.
 (DriverSuite.scala:40)
  Message: Process List(./bin/spark-class, 
 org.apache.spark.DriverWithoutCleanup,
 local) exited with code 1
  Occurred at table row 0 (zero based, not counting headings), which
 had values (
master = local
  )

 SparkSubmitSuite:
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 - launch simple application with spark-submit *** FAILED ***
org.apache.spark.SparkException: Process List(./bin/spark-submit,
 --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp,
 --master, local, file:/tmp/1408015655220-0/testJar-1408015655220.jar)
 exited with code 1

at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:810)
at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(
 SparkSubmitSuite.scala:311)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.
 apply$mcV$sp(SparkSubmitSuite.scala:291)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.
 apply(SparkSubmitSuite.scala:284)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.
 apply(SparkSubmitSuite.scala:284)
at org.scalatest.Transformer$$anonfun$apply$1.apply(
 Transformer.scala:22)
at org.scalatest.Transformer$$anonfun$apply$1.apply(
 Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
...
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 - spark submit includes jars passed in through --jar *** FAILED ***
org.apache.spark.SparkException: Process List(./bin/spark-submit,
 --class, org.apache.spark.deploy.JarCreationTest, --name, testApp,
 --master, local-cluster[2,1,512], --jars, file:/tmp/1408015659416-0/
 testJar-1408015659471.jar,fi
 le:/tmp/1408015659472-0/testJar-1408015659513.jar,
 file:/tmp/1408015659415-0/testJar-1408015659416.jar) exited with code 1
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:810)
at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(
 SparkSubmitSuite.scala:311)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.
 apply$mcV$sp(SparkSubmitSuite.scala:305)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.
 apply(SparkSubmitSuite.scala:294)
at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.
 apply(SparkSubmitSuite.scala:294)
at org.scalatest.Transformer$$anonfun$apply$1.apply(
 Transformer.scala:22)
at org.scalatest.Transformer$$anonfun$apply$1.apply(
 Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
...


 but only test the specific suite as follows will be ok:
 mvn -Pyarn -Phadoop-2.4 -Phive -DwildcardSuites=org.apache.spark.DriverSuite
 test

 it seems when run with mvn -Pyarn -Phadoop-2.4 -Phive test,the process
 with Utils.executeAndGetOutput started can not exited successfully
 (exitcode is not zero)

 anyone has idea for this?






 --

 Best Regards
 Fei Wang

 
 



 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: mvn test error

2014-08-19 Thread Cheng Lian
Just FYI, thought this might be helpful, I'm refactoring Hive Thrift server
test suites. These suites also fork new processes and suffer similar
issues. Stdout and stderr of forked processes are logged in the new version
of test suites with utilities under scala.sys.process package
https://github.com/apache/spark/pull/1856/files


On Tue, Aug 19, 2014 at 2:55 PM, scwf wangf...@huawei.com wrote:

 hi,Cheng Lian
   thanks, printing stdout/stderr of the forked process is more reasonable.

 On 2014/8/19 13:35, Cheng Lian wrote:

 The exception indicates that the forked process doesn’t executed as
 expected, thus the test case /should/ fail.

 Instead of replacing the exception with a |logWarning|, capturing and
 printing stdout/stderr of the forked process can be helpful for diagnosis.
 Currently the only information we have at hand is the process exit code,
 it’s hard to determine the reason why the forked process fails.


 ​


 On Tue, Aug 19, 2014 at 1:27 PM, scwf wangf...@huawei.com mailto:
 wangf...@huawei.com wrote:

 hi, all
I notice that jenkins may also throw this error when running tests(
 https://amplab.cs.__berkeley.edu/jenkins/job/__
 SparkPullRequestBuilder/18688/__consoleFull https://amplab.cs.berkeley.
 edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull).



 This is because in Utils.executeAndGetOutput our progress exitCode is
 not 0, may be we should logWarning here rather than throw a exception?

 Utils.executeAndGetOutput {
  val exitCode = process.waitFor()
  stdoutThread.join()   // Wait for it to finish reading output
  if (exitCode != 0) {
throw new SparkException(Process  + command +  exited with
 code  + exitCode)
  }
 }

 any idea?



 On 2014/8/15 11:01, scwf wrote:

 env: ubuntu 14.04 + spark master buranch

 mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
 clean package

 mvn -Pyarn -Phadoop-2.4 -Phive test

 test error:

 DriverSuite:
 Spark assembly has been built with Hive, including Datanucleus
 jars on classpath
 - driver should exit after finishing *** FAILED ***
 SparkException was thrown during property evaluation.
 (DriverSuite.scala:40)
   Message: Process List(./bin/spark-class, 
 org.apache.spark.__DriverWithoutCleanup,
 local) exited with code 1

   Occurred at table row 0 (zero based, not counting
 headings), which had values (
 master = local
   )

 SparkSubmitSuite:
 Spark assembly has been built with Hive, including Datanucleus
 jars on classpath
 - launch simple application with spark-submit *** FAILED ***
 org.apache.spark.__SparkException: Process
 List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.__SimpleApplicationTest,
 --name, testApp, --master, local, 
 file:/tmp/1408015655220-0/__testJar-1408015655220.jar)
 exited with code 1

 at org.apache.spark.util.Utils$._
 _executeAndGetOutput(Utils.__scala:810)
 at org.apache.spark.deploy.__SparkSubmitSuite.__
 runSparkSubmit(__SparkSubmitSuite.scala:311)
 at org.apache.spark.deploy.__SparkSubmitSuite$$anonfun$14._
 _apply$mcV$sp(SparkSubmitSuite.__scala:291)
 at org.apache.spark.deploy.__SparkSubmitSuite$$anonfun$14._
 _apply(SparkSubmitSuite.scala:__284)
 at org.apache.spark.deploy.__SparkSubmitSuite$$anonfun$14._
 _apply(SparkSubmitSuite.scala:__284)
 at org.scalatest.Transformer$$__anonfun$apply$1.apply(__
 Transformer.scala:22)
 at org.scalatest.Transformer$$__anonfun$apply$1.apply(__
 Transformer.scala:22)
 at org.scalatest.OutcomeOf$class.__outcomeOf(OutcomeOf.scala:
 85)
 at org.scalatest.OutcomeOf$.__outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.__apply(Transformer.scala:22)

 ...
 Spark assembly has been built with Hive, including Datanucleus
 jars on classpath
 - spark submit includes jars passed in through --jar *** FAILED
 ***
 org.apache.spark.__SparkException: Process
 List(./bin/spark-submit, --class, org.apache.spark.deploy.__JarCreationTest,
 --name, testApp, --master, local-cluster[2,1,512], --jars,
 file:/tmp/1408015659416-0/__testJar-1408015659471.jar,fi
 le:/tmp/1408015659472-0/__testJar-1408015659513.jar,
 file:/tmp/1408015659415-0/__testJar-1408015659416.jar) exited with code 1
 at org.apache.spark.util.Utils$._
 _executeAndGetOutput(Utils.__scala:810)
 at org.apache.spark.deploy.__SparkSubmitSuite.__
 runSparkSubmit(__SparkSubmitSuite.scala:311)
 at org.apache.spark.deploy.__SparkSubmitSuite$$anonfun$15._
 _apply$mcV$sp(SparkSubmitSuite.__scala:305)
 at org.apache.spark.deploy.__SparkSubmitSuite$$anonfun$15._
 _apply(SparkSubmitSuite.scala:__294)
 at org.apache.spark.deploy

Re: RDD replication in Spark

2014-08-27 Thread Cheng Lian
You may start from here
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712
.
​


On Mon, Aug 25, 2014 at 9:05 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,

 I've exercised multiple options available for persist() including  RDD
 replication. I have gone thru the classes that involve in caching/storing
 the RDDS at different levels. StorageLevel class plays a pivotal role by
 recording whether to use memory or disk or to replicate the RDD on multiple
 nodes.
 The class LocationIterator iterates over the preferred machines one by
 one  for
 each partition that is replicated. I got a rough idea of CoalescedRDD.
 Please correct me if I am wrong.

 But I am looking for the code that chooses the resources to replicate the
 RDDs. Can someone please tell me how replication takes place and how do we
 choose the resources for replication. I just want to know as to where
 should I look into to understand how the replication happens.



 Thank you so much!!!

 regards

 -Karthik



Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-27 Thread Cheng Lian
I believe in your case, the “magic” happens in TableReader.fillObject
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712.
Here we unwrap the field value according to the object inspector of that
field. It seems that somehow a FloatObjectInspector is specified for the
total_price field. I don’t think CSVSerde is responsible for this, since it
sets all field object inspectors to javaStringObjectInspector (here
https://github.com/ogrodnek/csv-serde/blob/f315c1ae4b21a8288eb939e7c10f3b29c1a854ef/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L59-L61
).

Which version of Spark SQL are you using? If you are using a snapshot
version, please provide the exact Git commit hash. Thanks!
​


On Tue, Aug 26, 2014 at 8:29 AM, chutium teng@gmail.com wrote:

 oops, i tried on a managed table, column types will not be changed

 so it is mostly due to the serde lib CSVSerDe
 (
 https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123
 )
 or maybe CSVReader from opencsv?...

 but if the columns are defined as string, no matter what type returned from
 custom SerDe or CSVReader, they should be cast to string at the end right?

 why do not use the schema from hive metadata directly?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: deleted: sql/hive/src/test/resources/golden/case sensitivity on windows

2014-08-28 Thread Cheng Lian
Colon is not allowed to be part of a Windows file name and I think Git just
cannot create this file while cloning. Remove the colon in the name string
of this test case
https://github.com/chouqin/spark/blob/76e3ba4264c4a0bc2c33ae6ac862fc40bc302d83/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala#L312
should solve the problem.

Would you mind to file a JIRA and a PR to fix this?
​


On Thu, Aug 28, 2014 at 1:26 AM, 洪奇 qiping@alibaba-inc.com wrote:

 Hi,

 I want to contribute some code to mllib, I forked apache/spark to my own
 repository (chouqin/spark),
 and used `git clone https://github.com/chouqin/spark.git` to checkout the
 code my windows system.
 In this directory, I run `git status` before doing anything, it output
 this:

 ```
 On branch master
 Your branch is up-to-date with 'origin/master'.

 Changes not staged for commit:
 (use git add/rm file... to update what will be committed)
 (use git checkout -- file... to discard changes in working directory)

 deleted: sql/hive/src/test/resources/golden/case sensitivity: Hive
 table-0-5d14d21a239daa42b086cc895215009a
 ```

 I don't know why because nothing has been done. If I want to make some
 change, I have to be careful not to commit this deletion of file,
 This is every inconvenient for me because I always use `git add .` to
 stage all changes, now I have to add every file individually.

 Can someone give me any suggestions to deal with this, my system is
 Windows 7 and git version is 1.9.2.msysgit.0.
 Thanks for your help.Qiping


Re: Jira tickets for starter tasks

2014-08-28 Thread Cheng Lian
You can just start the work :)


On Thu, Aug 28, 2014 at 3:52 PM, Bill Bejeck bbej...@gmail.com wrote:

 Hi,

 How do I get a starter task jira ticket assigned to myself? Or do I just do
 the work and issue a pull request with the associated jira number?

 Thanks,
 Bill



Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-28 Thread Cheng Lian
+1. Tested Spark SQL Thrift server and CLI against a single node standalone
cluster.


On Thu, Aug 28, 2014 at 9:27 PM, Timothy Chen tnac...@gmail.com wrote:

 +1 Make-distrubtion works, and also tested simple spark jobs on Spark
 on Mesos on 8 node Mesos cluster.

 Tim

 On Thu, Aug 28, 2014 at 8:53 PM, Burak Yavuz bya...@stanford.edu wrote:
  +1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups
 between 1.5-5x compared to the 1.0.2 release.
 
  - Original Message -
  From: Patrick Wendell pwend...@gmail.com
  To: dev@spark.apache.org
  Sent: Thursday, August 28, 2014 8:32:11 PM
  Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
 
  I'll kick off the vote with a +1.
 
  On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.1.0!
 
  The tag to be voted on is v1.1.0-rc2 (commit 711aebb3):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1029/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.0!
 
  The vote is open until Monday, September 01, at 03:11 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == Regressions fixed since RC1 ==
  LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.0.X will not block
  this release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.io.compression.codec is now snappy
  -- Old behavior can be restored by switching to lzf
 
  2. PySpark now performs external spilling during aggregations.
  -- Old behavior can be restored by setting spark.shuffle.spill to
 false.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Cheng Lian
Just noticed one thing: although --with-hive is deprecated by -Phive,
make-distribution.sh still relies on $SPARK_HIVE (which was controlled by
--with-hive) to determine whether to include datanucleus jar files. This
means we have to do something like SPARK_HIVE=true ./make-distribution.sh
... to enable Hive support. Otherwise datanucleus jars are not included in
lib/.

This issue is similar to SPARK-3234
https://issues.apache.org/jira/browse/SPARK-3234, both
SPARK_HADOOP_VERSION and SPARK_HIVE are controlled by some deprecated
command line options.
​


On Fri, Aug 29, 2014 at 11:18 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Oh darn - I missed this update. GRR, unfortunately I think this means
 I'll need to cut a new RC. Thanks for catching this Nick.

 On Fri, Aug 29, 2014 at 10:18 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  [Let me know if I should be posting these comments in a different
 thread.]
 
  Should the default Spark version in spark-ec2 be updated for this
 release?
 
  Nick
 
 
 
  On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Hey Nicholas,
 
  Thanks for this, we can merge in doc changes outside of the actual
  release timeline, so we'll make sure to loop those changes in before
  we publish the final 1.1 docs.
 
  - Patrick
 
  On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   There were several formatting and typographical errors in the SQL docs
   that
   I've fixed in this PR. Dunno if we want to roll that into the release.
  
  
   On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com
 
   wrote:
  
   Okay I'll plan to add cdh4 binary as well for the final release!
  
   ---
   sent from my phone
   On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote:
  
We just used CDH 4.7 for our production cluster. And I believe we
won't
use CDH 5 in the next year.
   
Sent from my iPhone
   
 On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Personally I'd actually consider putting CDH4 back if there are
 still
users on it. It's always better to be inclusive, and the
 convenience
of
a
one-click download is high. Do we have a sense on what % of CDH
 users
still
use CDH4?

 Matei

 On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com
 )
 wrote:

 (Copying my reply since I don't know if it goes to the mailing
 list)

 Great, thanks for explaining the reasoning. You're saying these
 aren't
 going into the final release? I think that moots any issue
 surrounding
 distributing them then.

 This is all I know of from the ASF:
 https://community.apache.org/projectIndependence.html I don't
 read
 it
 as expressly forbidding this kind of thing although you can see
 how
 it
 bumps up against the spirit. There's not a bright line -- what
 about
 Tomcat providing binaries compiled for Windows for example? does
 that
 favor an OS vendor?

 From this technical ASF perspective only the releases matter --
 do
 what you want with snapshots and RCs. The only issue there is
 maybe
 releasing something different than was in the RC; is that at all
 confusing? Just needs a note.

 I think this theoretical issue doesn't exist if these binaries
 aren't
 released, so I see no reason to not proceed.

 The rest is a different question about whether you want to spend
 time
 maintaining this profile and candidate. The vendor already
 manages
 their build I think and -- and I don't know -- may even prefer
 not
 to
 have a different special build floating around. There's also the
 theoretical argument that this turns off other vendors from
 adopting
 Spark if it's perceived to be too connected to other vendors. I'd
 like
 to maximize Spark's distribution and there's some argument you do
 this
 by not making vendor profiles. But as I say a different question
 to
 just think about over time...

 (oh and PS for my part I think it's a good thing that CDH4
 binaries
 were removed. I wasn't arguing for resurrecting them)

 On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell
 pwend...@gmail.com
wrote:
 Hey Sean,

 The reason there are no longer CDH-specific builds is that all
 newer
 versions of CDH and HDP work with builds for the upstream Hadoop
 projects. I dropped CDH4 in favor of a newer Hadoop version
 (2.4)
 and
 the Hadoop-without-Hive (also 2.4) build.

 For MapR - we can't officially post those artifacts on ASF web
 space
 when we make the final release, we can only link to them as
 being
 hosted by MapR specifically since they use non-compatible
 licenses.
 However, I felt that providing these during a testing period was
 alright, 

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

Maybe we should add a developer notes page to document all these useful
black magic.


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:

 Having a SSD help tremendously with assembly time.

 Without that, you can do the following in order for Spark to pick up the
 compiled classes before assembly at runtime.

 export SPARK_PREPEND_CLASSES=true


 On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

  This doesn't help for every dependency, but Spark provides an option to
  build the assembly jar without Hadoop and its dependencies.  We make use
 of
  this in CDH packaging.
 
  -Sandy
 
 
  On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:
 
   Hi sean owen,
   here are some problems when i used assembly jar
   1 i put spark-assembly-*.jar to the lib directory of my application, it
   throw compile error
  
   Error:scalac: Error: class scala.reflect.BeanInfo not found.
   scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo
 not
   found.
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getModuleOrClass(Definitions.scala:655)
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getClass(Definitions.scala:608)
  
   at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
   init(GenJVM.scala:127)
  
   at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
   scala:85)
  
   at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
  
   at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
  
   at
  xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
  
   at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at sbt.compiler.AnalyzingCompiler.call(
   AnalyzingCompiler.scala:102)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:48)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:41)
  
   at org.jetbrains.jps.incremental.scala.local.
   IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
  
   at org.jetbrains.jps.incremental.scala.local.LocalServer.
   compile(LocalServer.scala:25)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
   scala:58)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
   Main.scala:21)
  
   at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
   Main.scala)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at
 com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
   2 i test my branch which updated hive version to org.apache.hive 0.13.1
 it run successfully when use a bag of 3rd jars as dependency but
 throw
   error using assembly jar, it seems assembly jar lead to conflict
 ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
   at org.apache.hadoop.hive.ql.io.parquet.serde.
   ArrayWritableObjectInspector.getObjectInspector(
   ArrayWritableObjectInspector.java:66)
   at org.apache.hadoop.hive.ql.io.parquet.serde.
  
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
   at org.apache.hadoop.hive.ql.io.parquet.serde.
   ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
   at org.apache.hadoop.hive.metastore.MetaStoreUtils.
   getDeserializer(MetaStoreUtils.java:339)
   at org.apache.hadoop.hive.ql.metadata.Table.
   getDeserializerFromMetaStore(Table.java:283)
   at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
   Table.java:189)
   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
   Hive.java:597)
   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
   DDLTask.java:4194)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
   java:281)
   at
 org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
   at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
   TaskRunner.java:85)
  
  
  
  
  
   On 2014/9/2 16:45, Sean Owen wrote:
  
   Hm, are you suggesting that the Spark 

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Cool, didn't notice that, thanks Josh!


On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com wrote:

 SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could
 probably be easier to find):
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


 On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com)
 wrote:

 Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

 Maybe we should add a developer notes page to document all these useful
 black magic.


 On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:

  Having a SSD help tremendously with assembly time.
 
  Without that, you can do the following in order for Spark to pick up the
  compiled classes before assembly at runtime.
 
  export SPARK_PREPEND_CLASSES=true
 
 
  On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
   This doesn't help for every dependency, but Spark provides an option
 to
   build the assembly jar without Hadoop and its dependencies. We make
 use
  of
   this in CDH packaging.
  
   -Sandy
  
  
   On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:
  
Hi sean owen,
here are some problems when i used assembly jar
1 i put spark-assembly-*.jar to the lib directory of my application,
 it
throw compile error
   
Error:scalac: Error: class scala.reflect.BeanInfo not found.
scala.tools.nsc.MissingRequirementError: class
 scala.reflect.BeanInfo
  not
found.
   
at scala.tools.nsc.symtab.Definitions$definitions$.
getModuleOrClass(Definitions.scala:655)
   
at scala.tools.nsc.symtab.Definitions$definitions$.
getClass(Definitions.scala:608)
   
at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
init(GenJVM.scala:127)
   
at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
scala:85)
   
at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
   
at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
   
at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
   
at
   xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
   
at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
   
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
   
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
   
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   
at java.lang.reflect.Method.invoke(Method.java:597)
   
at sbt.compiler.AnalyzingCompiler.call(
AnalyzingCompiler.scala:102)
   
at sbt.compiler.AnalyzingCompiler.compile(
AnalyzingCompiler.scala:48)
   
at sbt.compiler.AnalyzingCompiler.compile(
AnalyzingCompiler.scala:41)
   
at org.jetbrains.jps.incremental.scala.local.
IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
   
at org.jetbrains.jps.incremental.scala.local.LocalServer.
compile(LocalServer.scala:25)
   
at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
scala:58)
   
at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
Main.scala:21)
   
at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
Main.scala)
   
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
   
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   
at java.lang.reflect.Method.invoke(Method.java:597)
   
at
  com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
2 i test my branch which updated hive version to org.apache.hive
 0.13.1
it run successfully when use a bag of 3rd jars as dependency but
  throw
error using assembly jar, it seems assembly jar lead to conflict
ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
at org.apache.hadoop.hive.ql.io.parquet.serde.
ArrayWritableObjectInspector.getObjectInspector(
ArrayWritableObjectInspector.java:66)
at org.apache.hadoop.hive.ql.io.parquet.serde.
   
 
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
at org.apache.hadoop.hive.ql.io.parquet.serde.
ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.
getDeserializer(MetaStoreUtils.java:339)
at org.apache.hadoop.hive.ql.metadata.Table.
getDeserializerFromMetaStore(Table.java:283)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
Hive.java:597)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
DDLTask.java:4194)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
java:281)
at
  org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins
:)


On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 Welcome Shane =)


 - Henry

 On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote:
  so, i had a meeting w/the databricks guys on friday and they recommended
 i
  send an email out to the list to say 'hi' and give you guys a quick
 intro.
   :)
 
  hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
 spending
  time getting the jenkins build infrastructure up to production quality.
   much of this will be 'under the covers' work, like better system level
  auth, backups, etc, but some will definitely be user facing:  timely
  jenkins updates, debugging broken build infrastructure and some plugin
  support.
 
  i've been working in the bay area now since 1997 at many different
  companies, and my last 10 years has been split between google and
 palantir.
   i'm a huge proponent of OSS, and am really happy to be able to help with
  the work you guys are doing!
 
  if anyone has any requests/questions/comments, feel free to drop me a
 line!
 
  shane

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1

   - Tested Thrift server and SQL CLI locally on OSX 10.9.
   - Checked datanucleus dependencies in distribution tarball built by
   make-distribution.sh without SPARK_HIVE defined.

​


On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:

 +1

 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK
 8).


 best,
 wb


 - Original Message -
  From: Patrick Wendell pwend...@gmail.com
  To: dev@spark.apache.org
  Sent: Saturday, August 30, 2014 5:07:52 PM
  Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
 
  Please vote on releasing the following candidate as Apache Spark version
  1.1.0!
 
  The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc3/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1030/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.0!
 
  The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == Regressions fixed since RC1 ==
  - Build issue for SQL support:
  https://issues.apache.org/jira/browse/SPARK-3234
  - EC2 script version bump to 1.1.0.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.0.X will not block
  this release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.io.compression.codec is now snappy
  -- Old behavior can be restored by switching to lzf
 
  2. PySpark now performs external spilling during aggregations.
  -- Old behavior can be restored by setting spark.shuffle.spill to
 false.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Cheng Lian
+1.

Tested locally on OSX 10.9, built with Hadoop 2.4.1

- Checked Datanucleus jar files
- Tested Spark SQL Thrift server and CLI under local mode and standalone
cluster against MySQL backed metastore



On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen rosenvi...@gmail.com wrote:

 +1.  Tested on Windows and EC2.  Confirmed that the EC2 pvm-hvm switch
 fixed the SPARK-3358 regression.


 On September 3, 2014 at 10:33:45 AM, Marcelo Vanzin (van...@cloudera.com)
 wrote:

 +1 (non-binding)

 - checked checksums of a few packages
 - ran few jobs against yarn client/cluster using hadoop2.3 package
 - played with spark-shell in yarn-client mode

 On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.1.0!
 
  The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc4/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1031/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.0!
 
  The vote is open until Saturday, September 06, at 08:30 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == Regressions fixed since RC3 ==
  SPARK-3332 - Issue with tagging in EC2 scripts
  SPARK-3358 - Issue with regression for m3.XX instances
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.0.X will not block
  this release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.io.compression.codec is now snappy
  -- Old behavior can be restored by switching to lzf
 
  2. PySpark now performs external spilling during aggregations.
  -- Old behavior can be restored by setting spark.shuffle.spill to
 false.
 
  3. PySpark uses a new heuristic for determining the parallelism of
  shuffle operations.
  -- Old behavior can be restored by setting
  spark.default.parallelism to the number of cores in the cluster.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Question about SparkSQL and Hive-on-Spark

2014-09-24 Thread Cheng Lian
I don’t think so. For example, we’ve already added extended syntax like CACHE
TABLE.
​

On Wed, Sep 24, 2014 at 3:27 PM, Yi Tian tianyi.asiai...@gmail.com wrote:

 Hi Reynold!

 Will sparkSQL strictly obey the HQL syntax ?

 For example, the cube function.

 In other words, the hiveContext of sparkSQL should only implement the
 subset of HQL features?


 Best Regards,

 Yi Tian
 tianyi.asiai...@gmail.com




 On Sep 23, 2014, at 15:49, Reynold Xin r...@databricks.com wrote:

 
  On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian tianyi.asiai...@gmail.com
 wrote:
  Hi all,
 
  I have some questions about the SparkSQL and Hive-on-Spark
 
  Will SparkSQL support all the hive feature in the future? or just making
 hive as a datasource of Spark?
 
  Most likely not *ALL* Hive features, but almost all common features.
 
 
  From Spark 1.1.0 , we have thrift-server support running hql on spark.
 Will this feature be replaced by Hive on Spark?
 
  No.
 
 
  The reason for asking these questions is that we found some hive
 functions are not  running well on SparkSQL ( like window function, cube
 and rollup function)
 
  Is it worth for making effort on implement these functions with
 SparkSQL? Could you guys give some advices ?
 
  Yes absolutely.
 
 
  thank you.
 
 
  Best Regards,
 
  Yi Tian
  tianyi.asiai...@gmail.com
 
 
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 




Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together 
with the query you tried? The stacktrace suggests that the query was 
trying to cast a map into something else, which is not supported in 
Spark SQL. And I doubt whether Hive support casting a complex type to 
some other type.


On 9/27/14 7:48 AM, Du Li wrote:

Hi,

I was loading data into a partitioned table on Spark 1.1.0
beeline-thriftserver. The table has complex data types such as mapstring,
string and arraymapstring,string. The query is like ³insert overwrite
table a partition (Š) select Š² and the select clause worked if run
separately. However, when running the insert query, there was an error as
follows.

The source code of Cast.scala seems to only handle the primitive data
types, which is perhaps why the MatchError was thrown.

I just wonder if this is still work in progress, or I should do it
differently.

Thanks,
Du



scala.MatchError: MapType(StringType,StringType,true) (of class
org.apache.spark.sql.catalyst.types.MapType)
 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2

47)
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala

:84)
 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl

y(Projection.scala:66)
 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl

y(Projection.scala:50)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq

l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)
 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive

File$1.apply(InsertIntoHiveTable.scala:158)
 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive

File$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1

145)
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:

615)
 java.lang.Thread.run(Thread.java:722)






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together 
with the query you tried? The stacktrace suggests that the query was 
trying to cast a map into something else, which is not supported in 
Spark SQL. And I doubt whether Hive support casting a complex type to 
some other type.


On 9/27/14 7:48 AM, Du Li wrote:

Hi,

I was loading data into a partitioned table on Spark 1.1.0
beeline-thriftserver. The table has complex data types such as mapstring,
string and arraymapstring,string. The query is like ³insert overwrite
table a partition (Š) select Š² and the select clause worked if run
separately. However, when running the insert query, there was an error as
follows.

The source code of Cast.scala seems to only handle the primitive data
types, which is perhaps why the MatchError was thrown.

I just wonder if this is still work in progress, or I should do it
differently.

Thanks,
Du



scala.MatchError: MapType(StringType,StringType,true) (of class
org.apache.spark.sql.catalyst.types.MapType)

org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2
47)
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)

org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala
:84)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:66)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:50)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
 java.lang.Thread.run(Thread.java:722)






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Extending Scala style checks

2014-10-01 Thread Cheng Lian
Since we can easily catch the list of all changed files in a PR, I think 
we can start with adding the no trailing space check for newly changed 
files only?


On 10/2/14 9:24 AM, Nicholas Chammas wrote:

Yeah, I remember that hell when I added PEP 8 to the build checks and fixed
all the outstanding Python style issues. I had to keep rebasing and
resolving merge conflicts until the PR was merged.

It's a rough process, but thankfully it's also a one-time process. I might
be able to help with that in the next week or two if no-one else wants to
pick it up.

Nick

On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com
wrote:


The hard part here is updating the existing code base... which is going to
create merge conflicts with like all of the open PRs...

On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:


Ah, since there appears to be a built-in rule for end-of-line whitespace,
Michael and Cheng, y'all should be able to add this in pretty easily.

Nick

On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
wrote:


Hey Nick,

We can always take built-in rules. Back when we added this Prashant
Sharma actually did some great work that lets us write our own style
rules in cases where rules don't exist.

You can see some existing rules here:



https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle

Prashant has over time contributed a lot of our custom rules upstream
to stalastyle, so now there are only a couple there.

- Patrick

On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:

Please take a look at WhitespaceEndOfLineChecker under:
http://www.scalastyle.org/rules-0.1.0.html

Cheers

On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 

nicholas.cham...@gmail.com

wrote:
As discussed here https://github.com/apache/spark/pull/2619, it

would be

good to extend our Scala style checks to programmatically enforce as

many

of our style rules as possible.

Does anyone know if it's relatively straightforward to enforce

additional

rules like the no trailing spaces rule mentioned in the linked PR?

Nick






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
Hm, seems that 7u71 comes back again. Observed similar Kinesis 
compilation error just now: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull


Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. 
However, /usr/bin/javac -version says:


   Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp
   2000, 2008. All rights reserved.

Which JDK is actually used by Jenkins?

Cheng

On 10/21/14 8:28 AM, shane knapp wrote:


ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
fixed the SparkR build but apparently made Spark itself quite unhappy.  i
removed that JDK, triggered a build (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
and it compiled kinesis w/o dying a fiery death.

apparently 7u71 is stricter when compiling.  sad times.

sorry about that!

shane


On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com wrote:


The failure is in the Kinesis compoent, can you reproduce this if you
build with -Pkinesis-asl?

- Patrick

On Mon, Oct 20, 2014 at 5:08 PM, shane knapp skn...@berkeley.edu wrote:

hmm, strange.  i'll take a look.

On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu zhunanmcg...@gmail.com wrote:


yes, I can compile locally, too

but it seems that Jenkins is not happy now...
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

All failed to compile

Best,

--
Nan Zhu


On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:


I performed build on latest master branch but didn't get compilation

error.

FYI

On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com

(mailto:zhunanmcg...@gmail.com) wrote:

Hi,

I just submitted a patch

https://github.com/apache/spark/pull/2864/files

with one line change

but the Jenkins told me it's failed to compile on the unrelated

files?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console

Best,

Nan


​


Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
It's a new pull request builder written by Josh, integrated into our 
state-of-the-art PR dashboard :)


On 10/21/14 9:33 PM, Nan Zhu wrote:

just curious…what is this “NewSparkPullRequestBuilder”?

Best,

--
Nan Zhu

On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:

Hm, seems that 7u71 comes back again. Observed similar Kinesis 
compilation error just now: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull


Checked Jenkins slave nodes, saw /usr/java/latest points to 
jdk1.7.0_71. However, /usr/bin/javac -version says:


Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM
Corp 2000, 2008. All rights reserved.

Which JDK is actually used by Jenkins?

Cheng

On 10/21/14 8:28 AM, shane knapp wrote:


ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
fixed the SparkR build but apparently made Spark itself quite unhappy.  i
removed that JDK, triggered a build (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
and it compiled kinesis w/o dying a fiery death.

apparently 7u71 is stricter when compiling.  sad times.

sorry about that!

shane


On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendellpwend...@gmail.com  
mailto:pwend...@gmail.com  wrote:


The failure is in the Kinesis compoent, can you reproduce this if you
build with -Pkinesis-asl?

- Patrick

On Mon, Oct 20, 2014 at 5:08 PM, shane knappskn...@berkeley.edu  
mailto:skn...@berkeley.edu  wrote:

hmm, strange.  i'll take a look.

On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhuzhunanmcg...@gmail.com  
mailto:zhunanmcg...@gmail.com  wrote:


yes, I can compile locally, too

but it seems that Jenkins is not happy now...
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

All failed to compile

Best,

--
Nan Zhu


On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:


I performed build on latest master branch but didn't get compilation

error.

FYI

On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com  
mailto:zhunanmcg...@gmail.com

(mailto:zhunanmcg...@gmail.com) wrote:

Hi,

I just submitted a patch

https://github.com/apache/spark/pull/2864/files

with one line change

but the Jenkins told me it's failed to compile on the unrelated

files?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console

Best,

Nan

​






Re: HiveContext bug?

2014-10-28 Thread Cheng Lian
Hi Marcelo, yes this is a known Spark SQL bug and we've got PRs to fix it
(2887  2967). Not merged yet because newly merged Hive 0.13.1 support
causes some conflicts. Thanks for reporting this :)

On Tue, Oct 28, 2014 at 6:41 AM, Marcelo Vanzin van...@cloudera.com wrote:

 Well, looks like a huge coincidence, but this was just sent to github:
 https://github.com/apache/spark/pull/2967

 On Mon, Oct 27, 2014 at 3:25 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  Hey guys,
 
  I've been using the HiveFromSpark example to test some changes and I
  ran into an issue that manifests itself as an NPE inside Hive code
  because some configuration object is null.
 
  Tracing back, it seems that `sessionState` being a lazy val in
  HiveContext is causing it. That variably is only evaluated in [1],
  while the call in [2] causes a Driver to be initialized by [3], which
  the tries to use the thread-local session state ([4]) which hasn't
  been set yet.
 
  This could be seen as a Hive bug ([3] should probably be calling the
  constructor that takes a conf object), but is there a reason why these
  fields are lazy in HiveContext? I explicitly called
  SessionState.setCurrentSessionState() before the
  CommandProcessorFactory call and that seems to fix the issue too.
 
  [1]
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L305
  [2]
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L289
  [3]
 https://github.com/apache/hive/blob/9c63b2fdc35387d735f4c9d08761203711d4974b/ql/src/java/org/apache/hadoop/hive/ql/processors/CommandProcessorFactory.java#L104
  [4]
 https://github.com/apache/hive/blob/9c63b2fdc35387d735f4c9d08761203711d4974b/ql/src/java/org/apache/hadoop/hive/ql/Driver.java#L286
 
  --
  Marcelo



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: best IDE for scala + spark development?

2014-10-28 Thread Cheng Lian
My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac compatibility
bug months ago, and you may want to use the most recent version here
https://github.com/scala/scala-dist/blob/master/tool-support/src/emacs/contrib/dot-ctags



On Tue, Oct 28, 2014 at 4:26 PM, Duy Huynh duy.huynh@gmail.com wrote:

 thanks everyone.  i've been using vim and sbt recently, and i really like
 it.  it's lightweight, fast.  plus, ack, ctrl-t, nerdtre, etc. in vim do
 all the good work.

 but, as i'm not familiar with scala/spark api yet, i really wish to have
 these two things in vim + sbt.

 1.  code completion as in intellij (typing long method / class name in
 scala/spark isn't that fun!)

 2.  scala doc on the fly in the text editor (just so i don't have to switch
 back and forth between the text editor and the scala doc)

 did anyone have experience with adding these 2 things to vim?

 thanks!






 On Mon, Oct 27, 2014 at 5:14 PM, Will Benton wi...@redhat.com wrote:

  I'll chime in as yet another user who is extremely happy with sbt and a
  text editor.  (In my experience, running ack from the command line is
  usually just as easy and fast as using an IDE's find-in-project
 facility.)
  You can, of course, extend editors with Scala-specific IDE-like
  functionality (in particular, I am aware of -- but have not used --
 ENSIME
  for emacs or TextMate).
 
  Since you're new to Scala, you may not know that you can run any sbt
  command preceded by a tilde, which will watch files in your project and
 run
  the command when anything changes.  Therefore, running ~compile from
 the
  sbt repl will get you most of the continuous syntax-checking
 functionality
  you can get from an IDE.
 
  best,
  wb
 
  - Original Message -
   From: ll duy.huynh@gmail.com
   To: d...@spark.incubator.apache.org
   Sent: Sunday, October 26, 2014 10:07:20 AM
   Subject: best IDE for scala + spark development?
  
   i'm new to both scala and spark.  what IDE / dev environment do you
 find
  most
   productive for writing code in scala with spark?  is it just vim + sbt?
  or
   does a full IDE like intellij works out better?  thanks!
  
  
  
   --
   View this message in context:
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html
   Sent from the Apache Spark Developers List mailing list archive at
   Nabble.com.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 



Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian

Yes, these two combinations work for me.

On 10/29/14 12:32 PM, Zhan Zhang wrote:

-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is to enable 
hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but 
expected to go to upstream soon (Spark-3720).

Thanks.

Zhan Zhang


  
On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com wrote:



Thanks Patrick for the heads up.

I have not been successful to discover a combination of profiles (i.e.
enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
maven. Anyone who knows how to handle this - a quick note here would be
appreciated.



2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:


Hey Stephen,

In some cases in the maven build we now have pluggable source
directories based on profiles using the maven build helper plug-in.
This is necessary to support cross building against different Hive
versions, and there will be additional instances of this due to
supporting scala 2.11 and 2.10.

In these cases, you may need to add source locations explicitly to
intellij if you want the entire project to compile there.

Unfortunately as long as we support cross-building like this, it will
be an issue. Intellij's maven support does not correctly detect our
use of the maven-build-plugin to add source directories.

We should come up with a good set of instructions on how to import the
pom files + add the few extra source directories. Off hand I am not
sure exactly what the correct sequence is.

- Patrick

On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com wrote:

Hi Matei,
  Until my latest pull from upstream/master it had not been necessary to
add the hive profile: is it now??

I am not using sbt gen-idea. The way to open in intellij has been to Open
the parent directory. IJ recognizes it as a maven project.

There are several steps to do surgery on the yarn-parent / yarn projects

,

then do a full rebuild.  That was working until one week ago.
Intellij/maven is presently broken in  two ways:  this hive shim (which

may

yet hopefully be a small/simple fix - let us see) and  (2) the
NoClassDefFoundError
on ThreadFactoryBuilder from my prior emails -and which is quite a

serious

problem .

2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:


Hi Stephen,

How did you generate your Maven workspace? You need to make sure the

Hive

profile is enabled for it. For example sbt/sbt -Phive gen-idea.

Matei


On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com

wrote:

I have run on the command line via maven and it is fine:

mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn

-Phadoop-2.3

compile package install


But with the latest code Intellij builds do not work. Following is

one of

26 similar errors:


Error:(173, 38) not found: value HiveShim


Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))

^







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
You may first open the root pom.xml file in IDEA, and then go for menu 
View / Tool Windows / Maven Projects, then choose desired Maven profile 
combination under the Profiles node (e.g. I usually use hadoop-2.4 + 
hive + hive-0.12.0). IDEA will ask you to re-import the Maven projects, 
confirm, then it should be OK.


I can debug within IDEA with this approach. However, you have to clean 
the whole project before debugging Spark within IDEA if you compiled the 
project outside IDEA. Haven't got time to investigate this annoying issue.


Also, you can remove sub projects unrelated to your tasks to accelerate 
compilation and/or avoid other IDEA build issues (e.g. Avro related 
Spark streaming build failure in IDEA).


On 10/29/14 12:42 PM, Stephen Boesch wrote:
I am interested specifically in how to build (and hopefully 
run/debug..) under Intellij.  Your posts sound like command line maven 
- which has always been working already.


Do you have instructions for building in IJ?

2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com:


Yes, these two combinations work for me.


On 10/29/14 12:32 PM, Zhan Zhang wrote:

-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is
to enable hive-0.12.0. Note that the thrift-server is not
supported yet in hive-0.13, but expected to go to upstream
soon (Spark-3720).

Thanks.

Zhan Zhang


  On Oct 28, 2014, at 9:09 PM, Stephen Boesch
java...@gmail.com mailto:java...@gmail.com wrote:

Thanks Patrick for the heads up.

I have not been successful to discover a combination of
profiles (i.e.
enabling hive or hive-0.12.0 or hive-13.0) that works in
Intellij with
maven. Anyone who knows how to handle this - a quick note
here would be
appreciated.



2014-10-28 20:20 GMT-07:00 Patrick Wendell
pwend...@gmail.com mailto:pwend...@gmail.com:

Hey Stephen,

In some cases in the maven build we now have pluggable
source
directories based on profiles using the maven build
helper plug-in.
This is necessary to support cross building against
different Hive
versions, and there will be additional instances of
this due to
supporting scala 2.11 and 2.10.

In these cases, you may need to add source locations
explicitly to
intellij if you want the entire project to compile there.

Unfortunately as long as we support cross-building
like this, it will
be an issue. Intellij's maven support does not
correctly detect our
use of the maven-build-plugin to add source directories.

We should come up with a good set of instructions on
how to import the
pom files + add the few extra source directories. Off
hand I am not
sure exactly what the correct sequence is.

- Patrick

On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch
java...@gmail.com mailto:java...@gmail.com wrote:

Hi Matei,
  Until my latest pull from upstream/master it had
not been necessary to
add the hive profile: is it now??

I am not using sbt gen-idea. The way to open in
intellij has been to Open
the parent directory. IJ recognizes it as a maven
project.

There are several steps to do surgery on the
yarn-parent / yarn projects

,

then do a full rebuild.  That was working until
one week ago.
Intellij/maven is presently broken in  two ways: 
this hive shim (which


may

yet hopefully be a small/simple fix - let us see)
and  (2) the
NoClassDefFoundError
on ThreadFactoryBuilder from my prior emails -and
which is quite a

serious

problem .

2014-10-28 19:46 GMT-07:00 Matei Zaharia
matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com:

Hi Stephen,

How did you generate your Maven workspace? You
need to make sure the

Hive

profile is enabled for it. For example sbt/sbt
-Phive gen-idea.

Matei

On Oct 28, 2014, at 7:42 PM

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
Hao Cheng had just written such a from scratch guide for building 
Spark SQL in IDEA. Although it's written in Chinese, I think the 
illustrations are already descriptive enough.


http://www.cnblogs.com//articles/4058371.html


On 10/29/14 12:45 PM, Patrick Wendell wrote:

Btw - we should have part of the official docs that describes a full
from scratch build in IntelliJ including any gotchas. Then we can
update it if there are build changes that alter it. I created this
JIRA for it:

https://issues.apache.org/jira/browse/SPARK-4128

On Tue, Oct 28, 2014 at 9:42 PM, Stephen Boesch java...@gmail.com wrote:

I am interested specifically in how to build (and hopefully run/debug..)
under Intellij.  Your posts sound like command line maven - which has always
been working already.

Do you have instructions for building in IJ?

2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com:


Yes, these two combinations work for me.


On 10/29/14 12:32 PM, Zhan Zhang wrote:

-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0 is to enable
hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
but expected to go to upstream soon (Spark-3720).

Thanks.

Zhan Zhang


   On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com wrote:


Thanks Patrick for the heads up.

I have not been successful to discover a combination of profiles (i.e.
enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
maven. Anyone who knows how to handle this - a quick note here would be
appreciated.



2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:


Hey Stephen,

In some cases in the maven build we now have pluggable source
directories based on profiles using the maven build helper plug-in.
This is necessary to support cross building against different Hive
versions, and there will be additional instances of this due to
supporting scala 2.11 and 2.10.

In these cases, you may need to add source locations explicitly to
intellij if you want the entire project to compile there.

Unfortunately as long as we support cross-building like this, it will
be an issue. Intellij's maven support does not correctly detect our
use of the maven-build-plugin to add source directories.

We should come up with a good set of instructions on how to import the
pom files + add the few extra source directories. Off hand I am not
sure exactly what the correct sequence is.

- Patrick

On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com
wrote:

Hi Matei,
   Until my latest pull from upstream/master it had not been necessary
to
add the hive profile: is it now??

I am not using sbt gen-idea. The way to open in intellij has been to
Open
the parent directory. IJ recognizes it as a maven project.

There are several steps to do surgery on the yarn-parent / yarn
projects

,

then do a full rebuild.  That was working until one week ago.
Intellij/maven is presently broken in  two ways:  this hive shim
(which

may

yet hopefully be a small/simple fix - let us see) and  (2) the
NoClassDefFoundError
on ThreadFactoryBuilder from my prior emails -and which is quite a

serious

problem .

2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:


Hi Stephen,

How did you generate your Maven workspace? You need to make sure the

Hive

profile is enabled for it. For example sbt/sbt -Phive gen-idea.

Matei


On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com

wrote:

I have run on the command line via maven and it is fine:

mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn

-Phadoop-2.3

compile package install


But with the latest code Intellij builds do not work. Following is

one of

26 similar errors:


Error:(173, 38) not found: value HiveShim


Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))

 ^





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
Hm, the shim source folder could be automatically recognized some time 
before, although at a wrong directory level (sql/hive/v0.12.0/src 
instead of sql/hive/v0.12.0/src/main/scala), it compiles.


Just tried against a fresh checkout, indeed need to add shim source 
folder manually. Sorry for the confusion.


Cheng

On 10/29/14 1:05 PM, Patrick Wendell wrote:

Cheng - to make it recognize the new HiveShim for 0.12 I had to click
on spark-hive under packages in the left pane, then go to Open
Module Settings - then explicitly add the v0.12.0/src/main/scala
folder to the sources by navigating to it and then ctrl+click to add
it as a source. Did you have to do this?

On Tue, Oct 28, 2014 at 9:57 PM, Patrick Wendell pwend...@gmail.com wrote:

I just started a totally fresh IntelliJ project importing from our
root pom. I used all the default options and I added hadoop-2.4,
hive, hive-0.13.1 profiles. I was able to run spark core tests from
within IntelliJ. Didn't try anything beyond that, but FWIW this
worked.

- Patrick

On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian lian.cs@gmail.com wrote:

You may first open the root pom.xml file in IDEA, and then go for menu View
/ Tool Windows / Maven Projects, then choose desired Maven profile
combination under the Profiles node (e.g. I usually use hadoop-2.4 + hive
+ hive-0.12.0). IDEA will ask you to re-import the Maven projects, confirm,
then it should be OK.

I can debug within IDEA with this approach. However, you have to clean the
whole project before debugging Spark within IDEA if you compiled the project
outside IDEA. Haven't got time to investigate this annoying issue.

Also, you can remove sub projects unrelated to your tasks to accelerate
compilation and/or avoid other IDEA build issues (e.g. Avro related Spark
streaming build failure in IDEA).


On 10/29/14 12:42 PM, Stephen Boesch wrote:

I am interested specifically in how to build (and hopefully run/debug..)
under Intellij.  Your posts sound like command line maven - which has always
been working already.

Do you have instructions for building in IJ?

2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com:

Yes, these two combinations work for me.


On 10/29/14 12:32 PM, Zhan Zhang wrote:

-Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0 is to enable
hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
but expected to go to upstream soon (Spark-3720).

Thanks.

Zhan Zhang


   On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com wrote:


Thanks Patrick for the heads up.

I have not been successful to discover a combination of profiles (i.e.
enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
maven. Anyone who knows how to handle this - a quick note here would be
appreciated.



2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:


Hey Stephen,

In some cases in the maven build we now have pluggable source
directories based on profiles using the maven build helper plug-in.
This is necessary to support cross building against different Hive
versions, and there will be additional instances of this due to
supporting scala 2.11 and 2.10.

In these cases, you may need to add source locations explicitly to
intellij if you want the entire project to compile there.

Unfortunately as long as we support cross-building like this, it will
be an issue. Intellij's maven support does not correctly detect our
use of the maven-build-plugin to add source directories.

We should come up with a good set of instructions on how to import the
pom files + add the few extra source directories. Off hand I am not
sure exactly what the correct sequence is.

- Patrick

On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com
wrote:

Hi Matei,
   Until my latest pull from upstream/master it had not been necessary
to
add the hive profile: is it now??

I am not using sbt gen-idea. The way to open in intellij has been to
Open
the parent directory. IJ recognizes it as a maven project.

There are several steps to do surgery on the yarn-parent / yarn
projects

,

then do a full rebuild.  That was working until one week ago.
Intellij/maven is presently broken in  two ways:  this hive shim
(which

may

yet hopefully be a small/simple fix - let us see) and  (2) the
NoClassDefFoundError
on ThreadFactoryBuilder from my prior emails -and which is quite a

serious

problem .

2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:


Hi Stephen,

How did you generate your Maven workspace? You need to make sure the

Hive

profile is enabled for it. For example sbt/sbt -Phive gen-idea.

Matei


On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com

wrote:

I have run on the command line via maven and it is fine:

mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn

-Phadoop-2.3

compile package install


But with the latest code Intellij builds do not work. Following is

one of

26 similar errors:


Error:(173, 38) not found: value HiveShim

Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Cheng Lian
I often see this when I first build the whole Spark project with SBT, then
modify some code and tries to build and debug within IDEA, or vice versa. A
clean rebuild can always solve this.

On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell pwend...@gmail.com wrote:

 Does this happen if you clean and recompile? I've seen failures on and
 off, but haven't been able to find one that I could reproduce from a
 clean build such that we could hand it to the scala team.

 - Patrick

 On Sun, Nov 2, 2014 at 7:25 PM, Imran Rashid im...@therashids.com wrote:
  I'm finding the scala compiler crashes when I compile the spark-sql
 project
  in sbt.  This happens in both the 1.1 branch and master (full error
  below).  The other projects build fine in sbt, and everything builds fine
  in maven.  is there some sbt option I'm forgetting?  Any one else
  experiencing this?
 
  Also, are there up-to-date instructions on how to do common dev tasks in
  both sbt  maven?  I have only found these instructions on building with
  maven:
 
  http://spark.apache.org/docs/latest/building-with-maven.html
 
  and some general info here:
 
  https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
  but I think this doesn't walk through a lot of the steps of a typical dev
  cycle, eg, continuous compilation, running one test, running one main
  class, etc.  (especially since it seems like people still favor sbt for
  dev.)  If it doesn't already exist somewhere, I could try to put
 together a
  brief doc for how to do the basics.  (I'm returning to spark dev after a
  little hiatus myself, and I'm hitting some stumbling blocks that are
  probably common knowledge to everyone still dealing with it all the
 time.)
 
  thanks,
  Imran
 
  --
  full crash info from sbt:
 
  project sql
  [info] Set current project to spark-sql (in build
  file:/Users/imran/spark/spark/)
  compile
  [info] Compiling 62 Scala sources to
  /Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes...
  [info] Compiling 45 Scala sources and 39 Java sources to
  /Users/imran/spark/spark/sql/core/target/scala-2.10/classes...
  [error]
  [error]  while compiling:
 
 /Users/imran/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
  [error] during phase: jvm
  [error]  library version: version 2.10.4
  [error] compiler version: version 2.10.4
  [error]   reconstructed args: -classpath
 
 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng Lian
+1 since this is already the de facto model we are using.

On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote:

 +1

 发自我的 iPhone

  在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道:
 
  +1 great idea.
  On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:
 
  +1 (binding)
 
  On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com
  wrote:
  +1 (binding)
 
  On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
  wrote:
 
  +1 on this proposal.
 
  On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com
 wrote:
 
  Will these maintainers have a cleanup for those pending PRs upon we
  start
  to apply this model?
 
 
  I second Nan's question. I would like to see this initiative drive a
  reduction in the number of stale PRs we have out there. We're
  approaching
  300 open PRs again.
 
  Nick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian

Hey Sadhan,

I really don't think this is Spark log... Unlike Shark, Spark SQL 
doesn't even provide a Hive mode to let you execute queries against 
Hive. Would you please check whether there is an existing HiveServer2 
running there? Spark SQL HiveThriftServer2 is just a Spark port of 
HiveServer2, and they share the same default listening port. I guess the 
Thrift server didn't start successfully because the HiveServer2 occupied 
the port, and your Beeline session was probably linked against HiveServer2.


Cheng

On 11/11/14 8:29 AM, Sadhan Sood wrote:
I was testing out the spark thrift jdbc server by running a simple 
query in the beeline client. The spark itself is running on a yarn 
cluster.


However, when I run a query in beeline - I see no running jobs in the 
spark UI(completely empty) and the yarn UI seem to indicate that the 
submitted query is being run as a map reduce job. This is probably 
also being indicated from the spark logs but I am not completely sure:


2014-11-11 00:19:00,492 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1


2014-11-11 00:19:00,877 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2


2014-11-11 00:19:04,152 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2


2014-11-11 00:19:04,425 INFO Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - 
mapred.submit.replication is deprecated. Instead, use 
mapreduce.client.submit.file.replication


2014-11-11 00:19:04,516 INFO  client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager 
at :8032


2014-11-11 00:19:04,607 INFO  client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager 
at :8032


2014-11-11 00:19:04,639 WARN mapreduce.JobSubmitter 
(JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line 
option parsing not performed. Implement the Tool interface and execute 
your application with ToolRunner to remedy this


2014-11-11 00:00:08,806 INFO  input.FileInputFormat 
(FileInputFormat.java:listStatus(287)) - Total input paths to process 
: 14912


2014-11-11 00:00:08,864 INFO  lzo.GPLNativeCodeLoader 
(GPLNativeCodeLoader.java:clinit(34)) - Loaded native gpl library


2014-11-11 00:00:08,866 INFO  lzo.LzoCodec 
(LzoCodec.java:clinit(76)) - Successfully loaded  initialized 
native-lzo library [hadoop-lzo rev 
8e266e052e423af592871e2dfe09d54c03f6a0e8]


2014-11-11 00:00:09,873 INFO  input.CombineFileInputFormat 
(CombineFileInputFormat.java:createSplits(413)) - DEBUG: Terminated 
node allocation with : CompletedNodes: 1, size left: 194541317


2014-11-11 00:00:10,017 INFO  mapreduce.JobSubmitter 
(JobSubmitter.java:submitJobInternal(396)) - number of splits:615


2014-11-11 00:00:10,095 INFO  mapreduce.JobSubmitter 
(JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
job_1414084656759_0115


2014-11-11 00:00:10,241 INFO  impl.YarnClientImpl 
(YarnClientImpl.java:submitApplication(167)) - Submitted application 
application_1414084656759_0115



It seems like the query is being run as a hive query instead of spark 
query. The same query works fine when run from spark-sql cli.






Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Cheng Lian

Hey Sadhan,

Sorry for my previous abrupt reply. Submitting a MR job is definitely 
wrong here, I'm investigating. Would you mind to provide the 
Spark/Hive/Hadoop versions you are using? If you're using most recent 
master branch, a concrete commit sha1 would be very helpful.


Thanks!
Cheng


On 11/12/14 12:34 AM, Sadhan Sood wrote:

Hi Cheng,

I made sure the only hive server running on the machine is 
hivethriftserver2.


/usr/lib/jvm/default-java/bin/java -cp 
/usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf 
-Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master yarn 
--jars reporting.jar spark-internal


The query I am running is a simple count(*): select count(*) from Xyz 
where date_prefix=20141031 and pretty sure it's submitting a map 
reduce job based on the spark logs:


TakesRest=false

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=number

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=number

In order to set a constant number of reducers:

  set mapreduce.job.reduces=number

14/11/11 16:23:17 INFO ql.Context: New scratch dir is 
hdfs://fdsfdsfsdfsdf:9000/tmp/hive-ubuntu/hive_2014-11-11_16-23-17_333_5669798325805509526-2


Starting Job = job_1414084656759_0142, Tracking URL = 
http://xxx:8100/proxy/application_1414084656759_0142/ 
http://t.signauxdix.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XYg2zGvG-W8rBGxP1p8d-TW64zBkx56dS1Dd58vwq02?t=http%3A%2F%2Fec2-54-83-34-89.compute-1.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2Fsi=6222577584832512pi=626685a9-b628-43cc-91a1-93636171ce77


Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill 
job_1414084656759_0142



On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Hey Sadhan,

I really don't think this is Spark log... Unlike Shark, Spark SQL
doesn't even provide a Hive mode to let you execute queries
against Hive. Would you please check whether there is an existing
HiveServer2 running there? Spark SQL HiveThriftServer2 is just a
Spark port of HiveServer2, and they share the same default
listening port. I guess the Thrift server didn't start
successfully because the HiveServer2 occupied the port, and your
Beeline session was probably linked against HiveServer2.

Cheng


On 11/11/14 8:29 AM, Sadhan Sood wrote:

I was testing out the spark thrift jdbc server by running a
simple query in the beeline client. The spark itself is running
on a yarn cluster.

However, when I run a query in beeline - I see no running jobs
in the spark UI(completely empty) and the yarn UI seem to
indicate that the submitted query is being run as a map reduce
job. This is probably also being indicated from the spark logs
but I am not completely sure:

2014-11-11 00:19:00,492 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is

hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1

2014-11-11 00:19:00,877 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is

hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

2014-11-11 00:19:04,152 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is

hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

2014-11-11 00:19:04,425 INFO Configuration.deprecation
(Configuration.java:warnOnceIfDeprecated(1009)) -
mapred.submit.replication is deprecated. Instead, use
mapreduce.client.submit.file.replication

2014-11-11 00:19:04,516 INFO client.RMProxy
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
at :8032

2014-11-11 00:19:04,607 INFO client.RMProxy
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
at :8032

2014-11-11 00:19:04,639 WARN mapreduce.JobSubmitter
(JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop
command-line option parsing not performed. Implement the Tool
interface and execute your application with ToolRunner to remedy this

2014-11-11 00:00:08,806 INFO  input.FileInputFormat
(FileInputFormat.java:listStatus(287)) - Total input paths to
process : 14912

2014-11-11 00:00:08,864 INFO  lzo.GPLNativeCodeLoader
(GPLNativeCodeLoader.java:clinit(34)) - Loaded native gpl library

2014-11-11 00:00:08,866 INFO  lzo.LzoCodec
(LzoCodec.java:clinit(76)) - Successfully loaded  initialized
native-lzo library [hadoop-lzo rev
8e266e052e423af592871e2dfe09d54c03f6a0e8]

2014-11-11

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.


On 11/14/14 1:27 AM, Sadhan Sood wrote:

Thanks Chneg, Just one more question - does that mean that we still 
need enough memory in the cluster to uncompress the data before it can 
be compressed again or does that just read the raw data as is?


On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Currently there’s no way to cache the compressed sequence file
directly. Spark SQL uses in-memory columnar format while caching
table rows, so we must read all the raw data and convert them into
columnar format. However, you can enable in-memory columnar
compression by setting
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
property is already set to true by default in master branch and
branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:


We noticed while caching data from our hive tables which contain
data in compressed sequence file format that it gets uncompressed
in memory when getting cached. Is there a way to turn this off
and cache the compressed data as is ?

​



​


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-14 Thread Cheng Lian

+1

Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues 
are fixed. Hive version inspection works as expected.


On 11/15/14 8:25 AM, Zach Fry wrote:

+0

I expect to start testing on Monday but won't have enough results to change
my vote from +0
until Monday night or Tuesday morning.

Thanks,
Zach



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC1-tp9311p9370.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian

Hey Zhan,

This is a great question. We are also seeking for a stable API/protocol 
that works with multiple Hive versions (esp. 0.12+). SPARK-4114 
https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. 
Did some research into HCatalog recently, but I must confess that I’m 
not an expert on HCatalog, actually spent only 1 day on exploring it. So 
please don’t hesitate to correct me if I was wrong about the conclusions 
I made below.


First, although HCatalog API is more pleasant to work with, it’s 
unfortunately feature incomplete. It only provides a subset of most 
commonly used operations. For example, |HCatCreateTableDesc| maps only a 
subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, 
|skewedColNames| and |skewedColValues| are missing. It’s also impossible 
to alter table properties via HCatalog API (Spark SQL uses this to 
implement the |ANALYZE| command). The |hcat| CLI tool provides all those 
features missing in HCatalog API via raw Metastore API, and is 
structurally similar to the old Hive CLI.


Second, HCatalog API itself doesn’t ensure compatibility, it’s the 
Thrift protocol that matters. HCatalog is directly built upon raw 
Metastore API, and talks the same Metastore Thrift protocol. The problem 
we encountered in Spark SQL is that, usually we deploy Spark SQL Hive 
support with embedded mode (for testing) or local mode Metastore, and 
this makes us suffer from things like Metastore database schema changes. 
If Hive Metastore Thrift protocol is guaranteed to be downward 
compatible, then hopefully we can resort to remote mode Metastore and 
always depend on most recent Hive APIs. I had a glance of Thrift 
protocol version handling code in Hive, it seems that downward 
compatibility is not an issue. However I didn’t find any official 
documents about Thrift protocol compatibility.


That said, in the future, hopefully we can only depend on most recent 
Hive dependencies and remove the Hive shim layer introduced in branch 
1.2. For users who use exactly the same version of Hive as Spark SQL, 
they can use either remote or local/embedded Metastore; while for users 
who want to interact with existing legacy Hive clusters, they have to 
setup a remote Metastore and let the Thrift protocol to handle 
compatibility.


— Cheng

On 11/22/14 6:51 AM, Zhan Zhang wrote:


Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it.

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

.


​


Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Should emphasize that this is still a quick and rough conclusion, will 
investigate this in more detail after 1.2.0 release. Anyway we really 
like to provide Hive support in Spark SQL as smooth and clean as 
possible for both developers and end users.


On 11/22/14 11:05 PM, Cheng Lian wrote:


Hey Zhan,

This is a great question. We are also seeking for a stable 
API/protocol that works with multiple Hive versions (esp. 0.12+). 
SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was 
opened for this. Did some research into HCatalog recently, but I must 
confess that I’m not an expert on HCatalog, actually spent only 1 day 
on exploring it. So please don’t hesitate to correct me if I was wrong 
about the conclusions I made below.


First, although HCatalog API is more pleasant to work with, it’s 
unfortunately feature incomplete. It only provides a subset of most 
commonly used operations. For example, |HCatCreateTableDesc| maps only 
a subset of |CreateTableDesc|, properties like 
|storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are 
missing. It’s also impossible to alter table properties via HCatalog 
API (Spark SQL uses this to implement the |ANALYZE| command). The 
|hcat| CLI tool provides all those features missing in HCatalog API 
via raw Metastore API, and is structurally similar to the old Hive CLI.


Second, HCatalog API itself doesn’t ensure compatibility, it’s the 
Thrift protocol that matters. HCatalog is directly built upon raw 
Metastore API, and talks the same Metastore Thrift protocol. The 
problem we encountered in Spark SQL is that, usually we deploy Spark 
SQL Hive support with embedded mode (for testing) or local mode 
Metastore, and this makes us suffer from things like Metastore 
database schema changes. If Hive Metastore Thrift protocol is 
guaranteed to be downward compatible, then hopefully we can resort to 
remote mode Metastore and always depend on most recent Hive APIs. I 
had a glance of Thrift protocol version handling code in Hive, it 
seems that downward compatibility is not an issue. However I didn’t 
find any official documents about Thrift protocol compatibility.


That said, in the future, hopefully we can only depend on most recent 
Hive dependencies and remove the Hive shim layer introduced in branch 
1.2. For users who use exactly the same version of Hive as Spark SQL, 
they can use either remote or local/embedded Metastore; while for 
users who want to interact with existing legacy Hive clusters, they 
have to setup a remote Metastore and let the Thrift protocol to handle 
compatibility.


— Cheng

On 11/22/14 6:51 AM, Zhan Zhang wrote:


Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it.

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang



--
View this message in 
context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org
For additional commands, e-mail:dev-h...@spark.apache.org

.


​




Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update 
this before releasing 1.2.0 and caused you trouble...


Cheng

On 2/2/15 2:03 PM, ankits wrote:

Great, thank you very much. I was confused because this is in the docs:

https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
branch-1.2 branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md

Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will not be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case..

If this is no longer accurate, i could make a PR to remove it.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| 
since Spark 1.2.0. The reason why your web UI didn’t show you the cached 
table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-) 
Simply add a |.collect()| after the |sql(...)| call.


Cheng

On 2/2/15 12:23 PM, ankits wrote:


Thanks for your response. So AFAICT

calling parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory

and parallelize(1  to1024).map(i =KV(i, i.toString)).cache().count()  will
show me the size of a regular rdd.

But this will not show us the size when using cacheTable() right? Like if i
call

parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.registerTempTable(test)
sqc.cacheTable(test)
sqc.sql(SELECT COUNT(*) FROM test)

the web UI does not show us the size of the cached table.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian

Hey Yi,

I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would 
like to investigate this issue later. Would you please open an JIRA for 
it? Thanks!


Cheng

On 1/19/15 1:00 AM, Yi Tian wrote:


Is there any way to support multiple users executing SQL on one thrift 
server?


I think there are some problems for spark 1.2.0, for example:

 1. Start thrift server with user A
 2. Connect to thrift server via beeline with user B
 3. Execute “insert into table dest select … from table src”

then we found these items on hdfs:

|drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00
-rw-r--r--   3 A supergroup   2671 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0
|

You can see all the temporary path created on driver side (thrift 
server side) is owned by user B (which is what we expected).


But all the output data created on executor side is owned by user A, 
(which is NOT what we expected).
error owner of the output data cause 
|org.apache.hadoop.security.AccessControlException| while the driver 
side moving output data into |dest| table.


Is anyone know how to resolve this problem?

​




Re: Spark SQL, Hive Parquet data types

2015-02-20 Thread Cheng Lian
For the second question, we do plan to support Hive 0.14, possibly in 
Spark 1.4.0.


For the first question:

1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
   type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
   own Parquet support to handle both read path and write path when
   dealing with Parquet tables declared in Hive metastore, as long as
   you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports 
timestamp type natively. However, the Parquet versions bundled with Hive 
0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them 
supports timestamp type. Hive 0.14.0 “supports” read/write timestamp 
from/to Parquet by converting timestamps from/to Parquet binaries. 
Similarly, Impala converts timestamp into Parquet int96. This can be 
annoying for Spark SQL, because we must interpret Parquet files in 
different ways according to the original writer of the file. As Parquet 
matures, recent Parquet versions support more and more standard data 
types. Mappings from complex nested types to Parquet types are also 
being standardized 1 
https://github.com/apache/incubator-parquet-mr/pull/83.


On 2/20/15 6:50 AM, The Watcher wrote:


Still trying to get my head around Spark SQL  Hive.

1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.

Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst  spark's parquet relation support ?

Case in point : timestamps  Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?

I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?

2) Is there planned support for Hive 0.14 ?

Thanks


​


Re: Get size of rdd in memory

2015-01-30 Thread Cheng Lian
Here is a toy |spark-shell| session snippet that can show the memory 
consumption difference:


|import  org.apache.spark.sql.SQLContext
import  sc._

val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

setConf(spark.sql.shuffle.partitions,1)

case  class  KV(key:Int, value:String)

parallelize(1  to1024).map(i =KV(i, i.toString)).toSchemaRDD.cache().count()
parallelize(1  to1024).map(i =KV(i, i.toString)).cache().count()
|

You may see the result from the storage page of the web UI. It suggests 
the in-memory columnar version uses 11.6KB while the raw RDD version 
uses 76.6KB on my machine.


Not quite sure how to do the comparison programmatically. You can track 
the data source of the “Size in Memory” field showed in the web UI 
storage tab.


Cheng

On 1/30/15 6:15 PM, ankits wrote:


Hi,

I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what I
understand i will need a reference to the block manager, or something like
RDDInfo.fromRdd(rdd).memSize.

I could use reflection or whatever to override the private access modifiers.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: Data source API | Support for dynamic schema

2015-01-28 Thread Cheng Lian

Hi Aniket,

In general the schema of all rows in a single table must be same. This 
is a basic assumption made by Spark SQL. Schema union does make sense, 
and we're planning to support this for Parquet. But as you've mentioned, 
it doesn't help if types of different versions of a column differ from 
each other. Also, you need to reload the data source table after schema 
changes happen.


Cheng

On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:

I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without affecting existing rows. Basically, each row contains a schema hint
(id and version) and this allows developers to evolve schema over time and
perform migration at will. Since the schema needs to be specified upfront
in the data source API, one possible way would be to build a union of all
schema versions and handle populating row values appropriately. This works
in case columns have been added or deleted in the schema but doesn't work
if types have changed. I was wondering if it is possible to change the API
  to provide schema for each row instead of expecting data source to provide
schema upfront?

Thanks,
Aniket




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Cheng Lian
Talked with Yi offline, personally I think this feature is pretty 
useful, and the design makes sense, and he's already got a running 
prototype.


Yi, would you mind to open a PR for this? Thanks!

Cheng

On 1/6/15 5:25 PM, Yi Tian wrote:

Hi, all

I have create a JIRA ticket about adding a monitor page for Thrift 
server.


https://issues.apache.org/jira/browse/SPARK-5100

Anyone could review the design doc, and give some advises?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian
Oh sorry, I misread your question. I thought you were trying something 
like |parquetFile(“s3n://file1,hdfs://file2”)|. Yeah, it’s a valid bug. 
Thanks for opening the JIRA ticket and the PR!



Cheng

On 3/16/15 6:39 PM, Cheng Lian wrote:


Hi Pei-Lun,

We intentionally disallowed passing multiple comma separated paths in 
1.3.0. One of the reason is that users report that this fail when a 
file path contain an actual comma in it. In your case, you may do 
something like this:


|val  s3nDF  =  parquetFile(s3n://...
)
val  hdfsDF  =  parquetFile(hdfs://...)
val  finalDF  =  s3nDF.union(finalDF)
|

Cheng

On 3/16/15 4:03 PM, Pei-Lun Lee wrote:


Hi,

I am using Spark 1.3.0, where I cannot load parquet files from more than
one file system, say one s3n://... and another hdfs://..., which worked in
older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.

One way to fix this is instead of get a single FileSystem from default
configuration in ParquetRelation2, call Path.getFileSystem for each path.

Here's the JIRA link and pull request:
https://issues.apache.org/jira/browse/SPARK-6351
https://github.com/apache/spark/pull/5039

Thanks,
--
Pei-Lun


​


​
​

​


Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian

Hi Pei-Lun,

We intentionally disallowed passing multiple comma separated paths in 
1.3.0. One of the reason is that users report that this fail when a file 
path contain an actual comma in it. In your case, you may do something 
like this:


|val  s3nDF  =  parquetFile(s3n://...)
val  hdfsDF  =  parquetFile(hdfs://...)
val  finalDF  =  s3nDF.union(finalDF)
|

Cheng

On 3/16/15 4:03 PM, Pei-Lun Lee wrote:


Hi,

I am using Spark 1.3.0, where I cannot load parquet files from more than
one file system, say one s3n://... and another hdfs://..., which worked in
older version, or if I set spark.sql.parquet.useDataSourceApi=false in 1.3.

One way to fix this is instead of get a single FileSystem from default
configuration in ParquetRelation2, call Path.getFileSystem for each path.

Here's the JIRA link and pull request:
https://issues.apache.org/jira/browse/SPARK-6351
https://github.com/apache/spark/pull/5039

Thanks,
--
Pei-Lun


​


Wrong version on the Spark documentation page

2015-03-15 Thread Cheng Lian

It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/

But this page is updated (1.3.0) 
http://spark.apache.org/docs/latest/index.html


Cheng

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Understanding shuffle file name conflicts

2015-03-25 Thread Cheng Lian
Ah, I see where I'm wrong here. What are reused here are the shuffle map 
output files themselves, rather than the file paths. No new shuffle map 
output files are generated for the 2nd job. Thanks! Really need to walk 
through Spark core code again :)


Cheng

On 3/25/15 9:31 PM, Shao, Saisai wrote:

Hi Cheng,

I think your scenario is acceptable for Spark's shuffle mechanism and will not 
occur shuffle file name conflicts.

 From my understanding I think the code snippet you mentioned is the same RDD 
graph, just running twice, these two jobs will generate 3 stages, map stage and 
collect stage for the first job, only collect stage for the second job (map 
stage is the same as previous job). So these two jobs will only generate one 
copy of shuffle files in the first job, and fetch the shuffle data twice for 
each job. So name conflicts will not be occurred, since these two jobs rely on 
the same ShuffledRDD.

I think only shuffle write which generates shuffle files will have chance to 
meet name conflicts, multiple times of shuffle read is acceptable as the code 
snippet shows.

Thanks
Jerry



-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Wednesday, March 25, 2015 7:40 PM
To: Saisai Shao; Kannan Rajah
Cc: dev@spark.apache.org
Subject: Re: Understanding shuffle file name conflicts

Hi Jerry  Josh

It has been a while since the last time I looked into Spark core shuffle code, 
maybe I’m wrong here. But the shuffle ID is created along with 
ShuffleDependency, which is part of the RDD DAG. So if we submit multiple jobs 
over the same RDD DAG, I think the shuffle IDs in these jobs should duplicate. 
For example:

|val  dag  =  sc.parallelize(Array(1,2,3)).map(i = i -
|i).reduceByKey(_ + _)
dag.collect()
dag.collect()
|

  From the debug log output, I did see duplicated shuffle IDs in both jobs. 
Something like this:

|# Job 1
15/03/25 19:26:34 DEBUG BlockStoreShuffleFetcher: Fetching outputs for shuffle 
0, reduce 2

# Job 2
15/03/25 19:26:36 DEBUG BlockStoreShuffleFetcher: Fetching outputs for shuffle 
0, reduce 5
|

So it’s also possible that some shuffle output files get reused in different 
jobs. But Kannan, did you submit separate jobs over the same RDD DAG as I did 
above? If not, I’d agree with Jerry and Josh.

(Did I miss something here?)

Cheng

On 3/25/15 10:35 AM, Saisai Shao wrote:


Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so
even if you run the same job twice, the shuffle dependency as well as
shuffle id is different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name
conflict even in the same directory as I know.

Thanks
Jerry


2015-03-25 1:56 GMT+08:00 Kannan Rajah kra...@maprtech.com:


I am working on SPARK-1529. I ran into an issue with my change, where
the same shuffle file was being reused across 2 jobs. Please note
this only happens when I use a hard coded location to use for shuffle
files, say /tmp. It does not happen with normal code path that uses
DiskBlockManager to pick different directories for each run. So I
want to understand how DiskBlockManager guarantees that such a conflict will 
never happen.

Let's say the shuffle block id has a value of shuffle_0_0_0. So the
data file name is shuffle_0_0_0.data and index file name is shuffle_0_0_0.index.
If I run a spark job twice, one after another, these files get
created under different directories because of the hashing logic in
DiskBlockManager. But the hash is based off the file name, so how are
we sure that there won't be a conflict ever?

--
Kannan


​



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can 
always control it with |spark.sql.shuffle.partitions| property.


Cheng

On 2/23/15 1:38 PM, nitin wrote:


I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose all the
RDD metadata and when I re-start my driver, that data is kind of useless for
me (correct me if I am wrong).

I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
but I fear that in case my HDFS block size  partition file size, I will
get more partitions when reading instead of original schemaRdd.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses 
its own Parquet support to read partitioned Parquet tables declared in 
Hive metastore. Only writing to partitioned tables is not covered yet. 
These improvements will be included in Spark 1.3.0.


Just created SPARK-5948 to track writing to partitioned Parquet tables.

Cheng

On 2/20/15 10:58 PM, The Watcher wrote:


1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
its own Parquet support to handle both read path and write path when
dealing with Parquet tables declared in Hive metastore, as long as you’re
not writing to a partitioned table. So yes, you can.

Ah, I had missed the part about being partitioned or not. Is this related

to the work being done on ParquetRelation2 ?

We will indeed write to a partitioned table : do neither the read nor the
write path go through Spark SQL's parquet support in that case ? Is there a
JIRA/PR I can monitor to see when this would change ?

Thanks




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? 
So that you only need to cache the result table after restarting your 
service without recomputing it. Somewhat like checkpointing.


Cheng

On 2/22/15 12:55 AM, nitin wrote:

Hi All,

I intend to build a long running spark application which fetches data/tuples
from parquet, does some processing(time consuming) and then cache the
processed table (InMemoryColumnarTableScan). My use case is good retrieval
time for SQL query(benefits of Spark SQL optimizer) and data
compression(in-built in in-memory caching). Now the problem is that if my
driver goes down, I will have to fetch the data again for all the tables and
compute it and cache which is time consuming.

Is it possible to persist processed/cached RDDs on disk such that my system
up time is less when restarted after failure/going down?

On a side note, the data processing contains a shuffle step which creates
huge temporary shuffle files on local disk in temp folder and as per current
logic, shuffle files don't get deleted for running executors. This is
leading to my local disk getting filled up quickly and going out of space as
its a long running spark job. (running spark in yarn-client mode btw).

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian

Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the 
old one is tightly coupled with the Spark SQL framework, while the new 
one is based on data sources API. In both versions, we try to intercept 
operations over Parquet tables registered in metastore when possible for 
better performance (mainly filter push-down optimization and extra 
metadata for more accurate schema inference). The distinctions are:


1.

   For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” the read path. Namely whenever you query a Parquet table
   registered in metastore, we’re using our own Parquet implementation.

   For write path, we fallback to default Hive SerDe implementation
   (namely Spark SQL’s |InsertIntoHiveTable| operator).

2.

   For new data source version (set
   |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
   value in master and branch-1.3)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” both read and write path, but if you’re writing to a
   partitioned table, we still fallback to default Hive SerDe
   implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data 
source, but it’s not enabled if you’re not using data sources API 
specific DDL (|CREATE TEMPORARY TABLE table-name USING data-source|).


Cheng

On 2/23/15 10:05 PM, The Watcher wrote:


Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
own Parquet support to read partitioned Parquet tables declared in Hive
metastore. Only writing to partitioned tables is not covered yet. These
improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.


Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into the hive partitionned table SELECT the
registrered, what is the write-path in this case ? Full Hive with a
SparkSQL-Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks


​


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't 
got time to get it merged.


Considering the release is close, I can cherry-pick those Hive 12 fixes 
from #4107 and open a more surgical PR soon.


Cheng

On 2/24/15 4:18 AM, Michael Armbrust wrote:

On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com
wrote:


So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
every combination of Hadoop and Hive versions, etc., can be supported, but
even an example build from the Building Spark page isn't looking too good
to me.


I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: number of partitions for hive schemaRDD

2015-02-26 Thread Cheng Lian

Hi Masaki,

I guess what you saw is the partition number of the last stage, which 
must be 1 to perform the global phase of LIMIT. To tune partition number 
of normal shuffles like joins, you may resort to 
spark.sql.shuffle.partitions.


Cheng

On 2/26/15 5:31 PM, masaki rikitoku wrote:

Hi all

now, I'm trying the SparkSQL with hivecontext.

when I execute the hql like the following.

---

val ctx = new org.apache.spark.sql.hive.HiveContext(sc)
import ctx._

val queries = ctx.hql(select keyword from queries where dt =
'2015-02-01' limit 1000)

---

It seem that the number of the partitions ot the queries is set by 1.

Is this the specifications for schemaRDD, SparkSQL, HiveContext ?

Are there any means to set the number of partitions arbitrary value
except for explicit repartition


Masaki Rikitoku

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.


Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:

to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned Spark SQL columnar compression, which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan velvia.git...@gmail.com wrote:


+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

You've got to be a little bit careful here. NA in systems like R or

pandas

may have special meaning that is distinct from null.

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com

wrote:

Isn't that just null in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com
wrote:


I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com
wrote:

It shouldn't change the data source api at all because data sources

create

RDD[Row], and that gets converted into a DataFrame automatically

(previously

to SchemaRDD).





https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location
of
types. Types were previously defined in sql.catalyst.types, and now

moved to

sql.types. After 1.3, sql.catalyst is hidden from users, and all
public

APIs

have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com

wrote:

Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff

in

Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library,

designed

from the outset to be a persistent storage friendly.  May be one

day

it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more

details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com

wrote:

Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter

of

fact

most were for it). We can still change it if somebody lays out a

strong

argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
matei.zaha...@gmail.com
wrote:


The type alias means your methods can specify either type and

they

will

work. It's just another name for the same type. But Scaladocs

and

such

will
show DataFrame as the type.

Matei


On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho 

dirceu.semigh...@gmail.com wrote:

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have

an

implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:


Dirceu,

That is not possible because one cannot overload return

types.

SQLContext.parquetFile (and many other methods) needs to

return

some

type,

and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to

not

break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
dirceu.semigh...@gmail.com wrote:


Can't the SchemaRDD remain the same, but deprecated, and be

removed

in

the

release 1.5(+/- 1)  for example, and the new code been added
to

DataFrame?

With this, we don't impact in existing code for the next few
releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here 
https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala.


On 1/29/15 1:59 PM, Cheng Lian wrote:

Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.


Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:
to me the word DataFrame does come with certain expectations. one of 
them
is that the data is stored columnar. in R data.frame internally uses 
a list

of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap 
(such

as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use 
something like

that? Evan mentioned Spark SQL columnar compression, which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan velvia.git...@gmail.com 
wrote:



+1 having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

You've got to be a little bit careful here. NA in systems like R or

pandas

may have special meaning that is distinct from null.

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com

wrote:

Isn't that just null in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com
wrote:

I believe that most DataFrame implementations out there, like 
Pandas,
supports the idea of missing values / NA, and some support the 
idea of

Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin r...@databricks.com
wrote:

It shouldn't change the data source api at all because data sources

create

RDD[Row], and that gets converted into a DataFrame automatically

(previously

to SchemaRDD).




https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 

One thing that will break the data source API in 1.3 is the 
location

of
types. Types were previously defined in sql.catalyst.types, and now

moved to

sql.types. After 1.3, sql.catalyst is hidden from users, and all
public

APIs

have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan velvia.git...@gmail.com

wrote:

Hey guys,

How does this impact the data sources API?  I was planning on 
using

this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff

in

Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that. It is a
zero-serialisation, high performance binary vector library,

designed

from the outset to be a persistent storage friendly.  May be one

day

it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates 
parts of

Parquet, but I think it's worth it.  LMK if you'd like more

details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin r...@databricks.com

wrote:

Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter

of

fact

most were for it). We can still change it if somebody lays out a

strong

argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
matei.zaha...@gmail.com
wrote:


The type alias means your methods can specify either type and

they

will

work. It's just another name for the same type. But Scaladocs

and

such

will
show DataFrame as the type.

Matei


On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho 

dirceu.semigh...@gmail.com wrote:

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have

an

implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:


Dirceu,

That is not possible because one cannot overload return

types.

SQLContext.parquetFile (and many other methods) needs to

return

some

type,

and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to

not

break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
dirceu.semigh...@gmail.com wrote:


Can't the SchemaRDD

Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian
Thanks for reporting this! Would you mind to open JIRA tickets for both 
Spark and Parquet?


I'm not sure whether Parquet declares somewhere the user mustn't reuse 
byte arrays when using binary type. If it does, then it's a Spark bug. 
Anyway, this should be fixed.


Cheng

On 4/12/15 1:50 PM, Yijie Shen wrote:

Hi,

Suppose I create a dataRDD which extends RDD[Row], and each row is
GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is
reused among rows but has different content each time. When I convert it to
a dataFrame and save it as Parquet File, the file's row group statistic(max
 min) of Binary column would be wrong.



Here is the reason: In Parquet, BinaryStatistic just keep max  min as
parquet.io.api.Binary references, Spark sql would generate a new Binary
backed by the same Array[Byte] passed from row.
  reference backed max: Binary--ByteArrayBackedBinary--
Array[Byte]

Therefore, each time parquet updating row group's statistic, max  min
would always refer to the same Array[Byte], which has new content each
time. When parquet decides to save it into file, the last row's content
would be saved as both max  min.



It seems it is a parquet bug because it's parquet's responsibility to
update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?


The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: IntelliJ Runtime error

2015-04-04 Thread Cheng Lian
I found in general it's a pain to build/run Spark inside IntelliJ IDEA. 
I guess most people resort to this approach so that they can leverage 
the integrated debugger to debug and/or learn Spark internals. A more 
convenient way I'm using recently is resorting to the remote debugging 
feature. In this way, by adding driver/executor Java options, you may 
build and start the Spark applications/tests/daemons in the normal way 
and attach the debugger to it. I was using this to debug the 
HiveThriftServer2, and it worked perfectly.


Steps to enable remote debugging:

1. Menu Run / Edit configurations...
2. Click the + button, choose Remote
3. Choose Attach or Listen in Debugger mode according to your 
actual needs
4. Copy, edit, and add Java options suggested in the dialog to 
`--driver-java-options` or `--executor-java-options`
5. If you're using attaching mode, first start your Spark program, then 
start remote debugging in IDEA
6. If you're using listening mode, first start remote debugging in IDEA, 
and then start your Spark program.


Hope this can be helpful.

Cheng

On 4/4/15 12:54 AM, sara mustafa wrote:

Thank you, it works with me when I changed the dependencies from provided to
compile.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IntelliJ-Runtime-error-tp11383p11385.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: About akka used in spark

2015-06-10 Thread Cheng Lian
We only shaded protobuf dependencies because of compatibility issues. 
The source code is not modified.


On 6/10/15 1:55 PM, wangtao (A) wrote:


Hi guys,

I see group id of akka used in spark is “org.spark-project.akka”. What 
is its difference with the typesafe one? What is its version? And 
where can we get the source code?


Regards.





Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian

Would you mind to file a JIRA for this? Thanks!

Cheng

On 6/11/15 2:40 PM, Dong Lei wrote:


I think in standalone cluster mode, spark is supposed to do:

1.Download jars, files to driver

2.Set the driver’s class path

3.Driver setup a http file server to distribute these files

4.Worker download from driver and setup classpath

Right?

But somehow, the first step fails.

Even if I can make the first step works(use option1), it seems that 
the classpath in driver is not correctly set.


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 2:32 PM
*To:* Dong Lei
*Cc:* Dianfei (Keith) Han; dev@spark.apache.org
*Subject:* Re: How to support dependency jars and files on HDFS in 
standalone cluster mode?


Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
them to classpath, which is different from regular files.


Cheng

On 6/11/15 2:18 PM, Dong Lei wrote:

Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and
files) on HDFS?

Do you mean the driver will not need to setup a HTTP file server
for this scenario and the worker will fetch the jars and files
from HDFS?

Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org mailto:dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject:* Re: How to support dependency jars and files on HDFS in
standalone cluster mode?

Since the jars are already on HDFS, you can access them directly
in your Spark application without using --jars

Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files”
option when doing a spark-submit in a standalone cluster mode.
For example:

Spark-submit  …  --jars hdfs://ip/1.jar  ….
 hdfs://ip/app.jar (standalone cluster mode)

will not download 1.jar to driver’s http file server(but the
app.jar will be downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that
when doing sc.addJar to http file server, the function called
is Files.copy which does not support a remote location.

And I think if spark can download the jars and add them to
http file server, the classpath is not correctly set, because
the classpath contains remote location.

So I’m trying to make it work and come up with two options,
but neither of them seem to be elegant, and I want to hear
your advices:

Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs”
prefix.

This is not good because I think it breaks the scope of http
file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the
“--jars” and “--files” with the application jar.

This sounds more reasonable that option 1 for downloading
files. But this way I need to read the “spark.jars” and
“spark.files” on downloadUserJar or DriverRunnder.start and
replace it with a local path. How can I do that?

Do you have a more elegant solution, or do we have a plan to
support it in the furture?

Thanks

Dong Lei





Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
Since the jars are already on HDFS, you can access them directly in your 
Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:


Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option 
when doing a spark-submit in a standalone cluster mode. For example:


Spark-submit  …   --jars hdfs://ip/1.jar  …. 
 hdfs://ip/app.jar (standalone cluster mode)


will not download 1.jar to driver’s http file server(but the app.jar 
will be downloaded to the driver’s dir).


I figure out the reason spark not downloading the jars is that when 
doing sc.addJar to http file server, the function called is Files.copy 
which does not support a remote location.


And I think if spark can download the jars and add them to http file 
server, the classpath is not correctly set, because the classpath 
contains remote location.


So I’m trying to make it work and come up with two options, but 
neither of them seem to be elegant, and I want to hear your advices:


Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the “--jars” 
and “--files” with the application jar.


This sounds more reasonable that option 1 for downloading files. But 
this way I need to read the “spark.jars” and “spark.files” on 
downloadUserJar or DriverRunnder.start and replace it with a local 
path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support 
it in the furture?


Thanks

Dong Lei





Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Cheng Lian
Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
them to classpath, which is different from regular files.


Cheng

On 6/11/15 2:18 PM, Dong Lei wrote:


Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and 
files) on HDFS?


Do you mean the driver will not need to setup a HTTP file server for 
this scenario and the worker will fetch the jars and files from HDFS?


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject:* Re: How to support dependency jars and files on HDFS in 
standalone cluster mode?


Since the jars are already on HDFS, you can access them directly in 
your Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option
when doing a spark-submit in a standalone cluster mode. For example:

Spark-submit  …   --jars hdfs://ip/1.jar  ….
 hdfs://ip/app.jar (standalone cluster mode)

will not download 1.jar to driver’s http file server(but the
app.jar will be downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that
when doing sc.addJar to http file server, the function called is
Files.copy which does not support a remote location.

And I think if spark can download the jars and add them to http
file server, the classpath is not correctly set, because the
classpath contains remote location.

So I’m trying to make it work and come up with two options, but
neither of them seem to be elegant, and I want to hear your advices:

Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file
server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the
“--jars” and “--files” with the application jar.

This sounds more reasonable that option 1 for downloading files.
But this way I need to read the “spark.jars” and “spark.files” on
downloadUserJar or DriverRunnder.start and replace it with a local
path. How can I do that?

Do you have a more elegant solution, or do we have a plan to
support it in the furture?

Thanks

Dong Lei





Re: possible issues with listing objects in the HadoopFSrelation

2015-08-12 Thread Cheng Lian

Hi Gil,

Sorry for the late reply and thanks for raising this question. The file 
listing logic in HadoopFsRelation is intentionally made different from 
Hadoop FileInputFormat. Here are the reasons:


1. Efficiency: when computing RDD partitions, 
FileInputFormat.listStatus() is called on the driver side in a 
sequential manner, and can be slow for S3 directories with lots of 
sub-directories, e.g. partitioned tables with thousands or even more 
partitions. This is partly because file metadata operation can be very 
slow on S3. HadoopFsRelation relies on this file listing action to do 
partition discovery, and we've made a distributed parallel version in 
Spark 1.5: we first list input paths on driver side in a sequential 
breadth-first manner, and once we find the number of directories to be 
listed exceeds a threshold (32 by default), we launch a Spark job to do 
file listing. With this mechanism, we've observed 2 orders of magnitude 
performance boost when reading partitioned table with thousands of 
distinct partitions located on S3.


2. Semantics difference: the default hiddenFileFilter doesn't apply in 
every cases. For example, Parquet summary files _metadata and 
_common_metadata plays crucial roles in schema discovery and schema 
merging, and we don't want to exclude them when listing the files. But 
they are removed when reading the actual data. However, we probably 
should allow users to pass in user defined path filters.


Cheng

On 8/10/15 7:55 PM, Gil Vernik wrote:

Just some thoughts, hope i didn't missed something obvious.

HadoopFSRelation calls directly FileSystem class to list files in the 
path.
It looks like it implements basically the same logic as in the 
FileInputFormat.listStatus method ( located in 
hadoop-map-reduce-client-core)


The point is that HadoopRDD (or similar ) calls getSplits method that 
calls FileInputFormat.listStatus, while HadoopFSRelation calls 
FileSystem directly and both of them try to achieve listing of objects.


There might be various issues with this, for example this one 
https://issues.apache.org/jira/browse/SPARK-7868makes sure that 
_temporary is not returned in a result, but the the listing of 
FileInputFormat contains more logic,  it uses hidden PathFilter like this


*private**static**final*PathFilter */hiddenFileFilter/*= 
*new*PathFilter(){

*public**boolean*accept(Path p){
String name= p.getName();
*return*!name.startsWith(_)  !name.startsWith(.);
  }
};

In addition, custom FileOutputCommitter, may use other name than 
_temporary .


All this may lead that HadoopFSrelation and HadoopRDD will provide 
different lists from the same data source.


My question is: what the roadmap for this listing in HadoopFSrelation. 
Will it implement exactly the same logic like in 
FileInputFormat.listStatus, or may be one day HadoopFSrelation will 
call FileInputFormat.listStatus and provide custom PathFilter or 
MultiPathFilter? This way there will be single  code that list objects.


Thanks,
Gil.






Deleted unreleased version 1.6.0 from JIRA by mistake

2015-07-22 Thread Cheng Lian

Hi all,

The unreleased version 1.6.0 has was removed from JIRA due to my 
misoperation. I've added it back, but JIRA tickets that once targeted to 
1.6.0 now have empty target version/s. If you found tickets that should 
have targeted to 1.6.0, please help marking the target version/s field 
back to 1.6.0.


Thanks in advance and sorry for all the trouble!

Best,
Cheng

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian

Hey Hyukjin,

Sorry that I missed the JIRA ticket. Thanks for bring this issue up 
here, your detailed investigation.


From my side, I think this is a bug of Parquet. Parquet was designed to 
support schema evolution. When scanning a Parquet, if a column exists in 
the requested schema but missing in the file schema, that column is 
filled with null. This should also hold for pushed-down predicate 
filters. For example, if filter "a = 1" is pushed down but column "a" 
doesn't exist in the Parquet file being scanned, it's safe to assume "a" 
is null in all records and drop all of them. On the contrary, if "a IS 
NULL" is pushed down, all records should be preserved.


Apparently, before this issue is properly fixed on Parquet side, we need 
to workaround this issue from Spark side. Please see my comments of all 
3 of your solutions inlined below. In short, I'd like to have approach 1 
for branch-1.5 and approach 2 for master.


Cheng

On 10/28/15 10:11 AM, Hyukjin Kwon wrote:
When enabling mergedSchema and predicate filter, this fails 
since Parquet filters are pushed down regardless of each schema of the 
splits (or rather files).


Dominic Ricard reported this 
issue (https://issues.apache.org/jira/browse/SPARK-11103)


Even though this would work okay by setting 
spark.sql.parquet.filterPushdown to false, the default value of this 
is true. So this looks an issue.


My questions are,
is this clearly an issue?
and if so, which way would this be handled?


I thought this is an issue and I made three rough patches for this and 
tested them and this looks fine though.


The first approach looks simpler and appropriate as I presume from the 
previous approaches such as 
https://issues.apache.org/jira/browse/SPARK-11153
However, in terms of safety and performances, I also want to ensure 
which one would be a proper approach before trying to open a PR.


1. Simply set false to spark.sql.parquet.filterPushdown when using 
mergeSchema
This one is pretty simple and safe, I'd like to have this for 1.5.2, or 
1.5.3 if we can't make it for 1.5.2.


2. If spark.sql.parquet.filterPushdown is true, retrieve all the 
schema of every part-files (and also merged one) and check if each can 
accept the given schema and then, apply the filter only when they all 
can accept, which I think it's a bit over-implemented.
Actually we only need to calculate the intersection of all file 
schemata. We can make ParquetRelation.mergeSchemaInParallel return two 
StructTypes, the first one is the original merged schema, the other is 
the intersection of all file schemata, which only contains fields that 
exist in all file schemata. Then we decide which filter to pushed down 
according to the second StructType.


3. If spark.sql.parquet.filterPushdown is true, retrieve all the 
schema of every part-files (and also merged one) and apply the filter 
to each split (rather file) that can accept the filter which (I think 
it's hacky) ends up different configurations for each task in a job.
The idea I came up with at first was similar to this one. Instead of 
pulling all file schemata to driver side, we can push filter push-down 
to executor side. Namely, passing candidate filters to executor side, 
and compute the Parquet predicate filter according to each file schema. 
I haven't looked into this direction in depth, but we can probably put 
this part into CatalystReadSupport, which is now initialized on executor 
side.


However, correctness of this approach can only guaranteed by the 
defensive filtering we do in Spark SQL (i.e. apply all the filters no 
matter they are pushed down or not), but we are considering to remove it 
because it imposes unnecessary performance cost. This makes me hesitant 
to go along this way.


Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Cheng Lian
Yeah, two of the reasons why the built-in in-memory columnar storage 
doesn't achieve comparable compression ratio as Parquet are:


1. The in-memory columnar representation doesn't handle nested types. So 
array/map/struct values are not compressed.
2. Parquet may use more than one kind of compression methods to compress 
a single column. For example, dictionary  + RLE.


Cheng

On 9/2/15 3:58 PM, Nitin Goyal wrote:

I think spark sql's in-memory columnar cache already does compression. Check
out classes in following path :-

https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression

Although compression ratio is not as good as Parquet.

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/compress-in-memory-column-storage-used-in-sparksql-cache-table-tp13932p13937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-10 Thread Cheng Lian
Hi Shane,

I found that Jenkins has been in the status of "Jenkins is going to shut
down" for at least 4 hours (from ~23:30 Dec 9 to 3:45 Dec 10, PDT). Not
sure whether this is part of the schedule or related?

Cheng

On Thu, Dec 10, 2015 at 3:56 AM, shane knapp  wrote:

> here's the security advisory for the update:
>
> https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-12-09
>
> On Wed, Dec 9, 2015 at 9:55 AM, shane knapp  wrote:
> > reminder!  this is happening tomorrow morning.
> >
> > On Wed, Dec 2, 2015 at 7:20 PM, shane knapp  wrote:
> >> there's Yet Another Jenkins Security Advisory[tm], and a big release
> >> to patch it all coming out next wednesday.
> >>
> >> to that end i will be performing a jenkins update, as well as
> >> performing the work to resolve the following jira issue:
> >> https://issues.apache.org/jira/browse/SPARK-11255
> >>
> >> i will put jenkins in to quiet mode around 6am, start work around 7am
> >> and expect everything to be back up and building before 9am.  i'll
> >> post updates as things progress.
> >>
> >> please let me know ASAP if there's any problem with this schedule.
> >>
> >> shane
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-26 Thread Cheng Lian

+1

On 12/23/15 12:39 PM, Yin Huai wrote:

+1

On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee > wrote:


+1

On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson > wrote:

+1

On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen
>
wrote:

+1

On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang
> wrote:

+1

On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra
> wrote:

+1

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust
> wrote:

Please vote on releasing the following
candidate as Apache Spark version 1.6.0!

The vote is open until Friday, December 25,
2015 at 18:00 UTC and passes if a majority of
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

The tag to be voted on is _v1.6.0-rc4
(4062cda3087ae42c6c3cb24508fc1d3a931accdf)
_

The release files, including signatures,
digests, etc. can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/



Release artifacts are signed with the
following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be
found at:

https://repository.apache.org/content/repositories/orgapachespark-1176/

The test repository (versioned as v1.6.0-rc4)
for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1175/

The documentation corresponding to this
release can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/



===
== How can I help test this release? ==
===
If you are a Spark user, you can help us test
this release by taking an existing Spark
workload and running on this release
candidate, then reporting any regressions.


== What justifies a -1 vote for this release? ==

This vote is happening towards the end of the
1.6 QA period, so -1 votes should only occur
for significant regressions from 1.5. Bugs
already present in 1.5, minor regressions, or
bugs related to new features will not block
this release.


===
== What should happen to JIRA tickets still
targeting 1.6.0? ==

===
1. It is OK for documentation patches to
target 1.6.0 and still go into branch-1.6,
since documentations will be published
separately from the release.
2. New features for non-alpha-modules should
target 1.7+.
3. Non-blocker bug fixes should target 1.6.1
or 1.7.0, or drop the target version.


==
== Major changes to help you focus your testing ==

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian

Hey Pedro,

SQL programming guide is being updated. Here's the PR, but not merged 
yet: https://github.com/apache/spark/pull/13592


Cheng

On 6/17/16 9:13 PM, Pedro Rodriguez wrote:

Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even more 
with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 
documentation then the 2.0 documentation and it looks like not much 
time has been spent writing a Dataset guide/tutorial.


Preview Docs: 
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets 

Spark master docs: 
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md


I would like to spend the time to contribute an improvement to those 
docs with a more in depth examples of creating and using Datasets (eg 
using $ to select columns). Is this of value, and if so what should my 
next step be to get this going (create JIRA etc)?


--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
R Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com  | 
pedrorodriguez.io  | 909-353-4423
Github: github.com/EntilZha  | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience






Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
As mentioned in the PR description, this is just an initial PR to bring 
existing contents up to date, so that people can add more contents 
incrementally.


We should definitely cover more about Dataset.


Cheng


On 6/17/16 10:28 PM, Pedro Rodriguez wrote:

The updates look great!

Looks like many places are updated to the new APIs, but there still 
isn't a section for working with Datasets (most of the docs work with 
Dataframes). Are you planning on adding more? I am thinking something 
that would address common questions like the one I posted on the user 
email list earlier today.


Should I take discussion to your PR?

Pedro

On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <lian.cs@gmail.com 
<mailto:lian.cs@gmail.com>> wrote:


Hey Pedro,

SQL programming guide is being updated. Here's the PR, but not
merged yet: https://github.com/apache/spark/pull/13592

Cheng

On 6/17/16 9:13 PM, Pedro Rodriguez wrote:

Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even
more with Spark 2.0 in place of Dataframes. I looked at the 1.6.1
documentation then the 2.0 documentation and it looks like not
much time has been spent writing a Dataset guide/tutorial.

Preview Docs:

https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets

<https://home.apache.org/%7Epwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets>
Spark master docs:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md


I would like to spend the time to contribute an improvement to
those docs with a more in depth examples of creating and using
Datasets (eg using $ to select columns). Is this of value, and if
so what should my next step be to get this going (create JIRA etc)?

-- 
Pedro Rodriguez

PhD Student in Distributed Machine Learning | CU Boulder
R Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com <mailto:ski.rodrig...@gmail.com> |
pedrorodriguez.io <http://pedrorodriguez.io> | 909-353-4423

Github: github.com/EntilZha <http://github.com/EntilZha> |
LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience






--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com <mailto:ski.rodrig...@gmail.com> | 
pedrorodriguez.io <http://pedrorodriguez.io> | 909-353-4423
Github: github.com/EntilZha <http://github.com/EntilZha> | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience






Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian

Awesome! Congrats and welcome!!

On 2/9/16 2:55 AM, Shixiong(Ryan) Zhu wrote:

Congrats!!! Herman and Wenchen!!!

On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote:




On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia
> wrote:

Hi all,

The PMC has recently added two new Spark committers -- Herman
van Hovell and Wenchen Fan. Both have been heavily involved in
Spark SQL and Tungsten, adding new features, optimizations and
APIs. Please join me in welcoming Herman and Wenchen.

Matei


Congratulations !!!

-- 
Luciano Resende

http://people.apache.org/~lresende

http://twitter.com/lresende1975
http://lresende.blogspot.com/






Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian
Awesome! Congrats and welcome!!

Cheng

On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu 
wrote:

> Congrats!!! Herman and Wenchen!!!
>
>
> On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende 
> wrote:
>
>>
>>
>> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC has recently added two new Spark committers -- Herman van Hovell
>>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>>> adding new features, optimizations and APIs. Please join me in welcoming
>>> Herman and Wenchen.
>>>
>>> Matei
>>>
>>
>> Congratulations !!!
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Cheng Lian
Sorry for being late, I'm building a Spark branch based on the most 
recent master to test out 1.8.2-rc1, will post my result here ASAP.


Cheng


On 1/23/17 11:43 AM, Julien Le Dem wrote:

Hi Spark dev,
Here is the voting thread for parquet 1.8.2 release.
Cheng or someone else we would appreciate you verify it as well and 
reply to the thread.


On Mon, Jan 23, 2017 at 11:40 AM, Julien Le Dem > wrote:


+1
Followed:
https://cwiki.apache.org/confluence/display/PARQUET/How+To+Verify+A+Release


checked sums, ran the build and tests.
We would appreciate someone from the Spark project (Cheng?) to
verify the release as well.
CC'ing spark


On Mon, Jan 23, 2017 at 10:15 AM, Ryan Blue
> wrote:

+1

On Mon, Jan 23, 2017 at 10:15 AM, Daniel Weeks

wrote:

> +1 checked sums, built, tested
>
> On Mon, Jan 23, 2017 at 9:58 AM, Ryan Blue

> wrote:
>
> > Gabor, that md5 matches what I get. Are you sure you used
the right file?
> > It isn’t the same format that md5sum produces, but if you
check the
> octets
> > the hash matches..
> >
> > [blue@work Downloads]$ md5sum apache-parquet-1.8.2.tar.gz
> > b3743995bee616118c28f324598684ba apache-parquet-1.8.2.tar.gz
> >
> > rb
> > ​
> >
> > On Thu, Jan 19, 2017 at 8:06 AM, Gabor Szadovszky <
> > gabor.szadovs...@cloudera.com
> wrote:
> >
> > > Hi Ryan,
> > >
> > > I’ve downloaded the tar and checked the signature and
the checksums.
> SHA
> > > and ASC are fine. MD5 is not and the content does not
seem to be a
> common
> > > MD5 either:
> > > apache-parquet-1.8.2.tar.gz: B3 74 39 95 BE E6 16 11  8C
28 F3 24 59 86
> > 84
> > > BA
> > >
> > > The artifacts on Nexus are good with all the related
signatures and
> > > checksums. The source zip properly contains the files
from the repo
> with
> > > the tag apache-parquet-1.8.2.
> > >
> > > Regards,
> > > Gabor
> > >
> > > > On 19 Jan 2017, at 04:09, Ryan Blue > wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I propose the following RC to be released as official
Apache Parquet
> > > 1.8.2
> > > > release.
> > > >
> > > > The commit id is c6522788629e590a53eb79874b95f6c3ff11f16c
> > > > * This corresponds to the tag: apache-parquet-1.8.2
> > > > * https://github.com/apache/parquet-mr/tree/c6522788

> > > > *
> > > >
https://git-wip-us.apache.org/repos/asf/projects/repo?p=

> > > parquet-mr.git=commit=c6522788
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > > *
https://dist.apache.org/repos/dist/dev/parquet/apache-

> > > parquet-1.8.2-rc1
> > > >
> > > > You can find the KEYS file here:
> > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS

> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > > *
> > > >
https://repository.apache.org/content/groups/staging/org/

> > > apache/parquet/parquet/1.8.2/
> > > >
> > > > This is a patch release with backports from the master
branch. For a
> > > > detailed summary, see the spreadsheet here:
> > > >
> > > > *
> > > >
https://docs.google.com/spreadsheets/d/1NAuY3c77Egs6REu-

> > > UVkQqPswpVYVgZTTnY3bM0SPVRs/edit#gid=0
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > Please vote by the end of Monday, 18 January.
> > > >
> > > > [ ] +1 Release this as Apache Parquet 1.8.2
> > > > [ ] +0
> > > > [ ] -1 Do not release this because...
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > 

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-23 Thread Cheng Lian

This one seems to be relevant, but it's already fixed in 2.1.0.

One way to debug is to turn on trace log and check how the 
analyzer/optimizer behaves.



On 2/22/17 11:11 PM, StanZhai wrote:
Could this be related to 
https://issues.apache.org/jira/browse/SPARK-17733 ?



-- Original --
*From: * "Cheng Lian-3 [via Apache Spark Developers List]";<[hidden 
email] >;

*Send time:* Thursday, Feb 23, 2017 9:43 AM
*To:* "Stan Zhai"<[hidden email] 
>;

*Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

Just from the thread dump you provided, it seems that this particular 
query plan jams our optimizer. However, it's also possible that the 
driver just happened to be running optimizer rules at that particular 
time point.


Since query planning doesn't touch any actual data, could you please 
try to minimize this query by replacing the actual relations with 
temporary views derived from Scala local collections? In this way, it 
would be much easier for others to reproduce issue.


Cheng


On 2/22/17 5:16 PM, Stan Zhai wrote:

Thanks for lian's reply.

Here is the QueryPlan generated by Spark 1.6.2(I can't get it in 
Spark 2.1.0):

|...|
||

-- Original --
*Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

What is the query plan? We had once observed query plans that grow 
exponentially in iterative ML workloads and the query planner hangs 
forever. For example, each iteration combines 4 plan trees of the 
last iteration and forms a larger plan tree. The size of the plan 
tree can easily reach billions of nodes after 15 iterations.



On 2/22/17 9:29 AM, Stan Zhai wrote:

Hi all,

The driver hangs at DataFrame.rdd in Spark 2.1.0 when the 
DataFrame(SQL) is complex, Following thread dump of my driver:

...







If you reply to this email, your message will be added to the 
discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html 

To start a new topic under Apache Spark Developers List, email [hidden 
email] 

To unsubscribe from Apache Spark Developers List, click here.
NAML 
<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 




View this message in context: Re: The driver hangs at DataFrame.rdd in 
Spark 2.1.0 
<http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21054.html>
Sent from the Apache Spark Developers List mailing list archive 
<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.




Re: welcoming Xiao Li as a committer

2016-10-04 Thread Cheng Lian
Congratulations!!!

Cheng

On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin  wrote:

> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
>


Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian

JIRA: https://issues.apache.org/jira/browse/SPARK-18403

PR: https://github.com/apache/spark/pull/15845

Will merge it as soon as Jenkins passes.

Cheng

On 11/10/16 11:30 AM, Dongjoon Hyun wrote:

Great! Thank you so much, Cheng!

Bests,
Dongjoon.

On 2016-11-10 11:21 (-0800), Cheng Lian <lian.cs@gmail.com> wrote:

Hey Dongjoon,

Thanks for reporting. I'm looking into these OOM errors. Already
reproduced them locally but haven't figured out the root cause yet.
Gonna disable them temporarily for now.

Sorry for the inconvenience!

Cheng


On 11/10/16 8:48 AM, Dongjoon Hyun wrote:

Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian

Hey Dongjoon,

Thanks for reporting. I'm looking into these OOM errors. Already 
reproduced them locally but haven't figured out the root cause yet. 
Gonna disable them temporarily for now.


Sorry for the inconvenience!

Cheng


On 11/10/16 8:48 AM, Dongjoon Hyun wrote:

Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Parquet patch release

2017-01-09 Thread Cheng Lian
Finished reviewing the list and it LGTM now (left comments in the 
spreadsheet and Ryan already made corresponding changes).


Ryan - Thanks a lot for pushing this and making it happen!

Cheng


On 1/6/17 3:46 PM, Ryan Blue wrote:
Last month, there was interest in a Parquet patch release on PR #16281 
. I went ahead and 
reviewed commits that should go into a Parquet patch release and 
started a 1.8.2 discussion 
 
on the Parquet dev list. If you're interested in reviewing what goes 
into 1.8.2 or have suggestions, please follow that thread on the 
Parquet list.


Thanks!

rb

--
Ryan Blue
Software Engineer
Netflix




Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Cheng Lian

+1


On 10/12/17 20:10, Liwei Lin wrote:

+1 !

Cheers,
Liwei

On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote:


+1

Regards,
Vaquar khan

On Oct 11, 2017 10:14 PM, "Weichen Xu" > wrote:

+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li
> wrote:

+1

Xiao

On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin
> wrote:

+1

One thing with MetadataSupport - It's a bad idea to
call it that unless adding new functions in that trait
wouldn't break source/binary compatibility in the future.


On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan
> wrote:

I'm adding my own +1 (binding).

On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan
>
wrote:

I'm going to update the proposal: for the last
point, although the user-facing API
(`df.write.format(...).option(...).mode(...).save()`)
mixes data and metadata operations, we are
still able to separate them in the data source
write API. We can have a mix-in trait
`MetadataSupport` which has a method
`create(options)`, so that data sources can
mix in this trait and provide metadata
creation support. Spark will call this
`create` method inside `DataFrameWriter.save`
if the specified data source has it.

Note that file format data sources can ignore
this new trait and still write data without
metadata(it doesn't have metadata anyway).

With this updated proposal, I'm calling a new
vote for the data source v2 write path.

The vote will be up for the next 72 hours.
Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because
of the following technical reasons.

Thanks!

On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan
> wrote:

Hi all,

After we merge the infrastructure of data
source v2 read path, and have some
discussion for the write path, now I'm
sending this email to call a vote for Data
Source v2 write path.

The full document of the Data Source API
V2 is:

https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit



The ready-for-review PR that implements
the basic infrastructure for the write path:
https://github.com/apache/spark/pull/19269



The Data Source V1 write path asks
implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame
to Data Source API is not good for
maintenance.
2. Data sources may need to preprocess the
input data before writing, like
cluster/sort the input by some columns.
It's better to do the preprocessing in
Spark instead of in the data source.
3. Data sources need to take care of
transaction themselves, which is hard. And
different data sources may come up with a
very similar approach for the transaction,
which leads to many duplicated codes.

To solve these pain points, 

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Cheng Lian

+1 (binding)

Passed all the tests, looks good.

Cheng


On 2/23/18 15:00, Holden Karau wrote:

+1 (binding)
PySpark artifacts install in a fresh Py3 virtual env

On Feb 23, 2018 7:55 AM, "Denny Lee" > wrote:


+1 (non-binding)

On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough
> wrote:

New to testing out Spark RCs for the community but I was able
to run some of the basic unit tests without error so for what
it's worth, I'm a +1.

On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal
> wrote:

Please vote on releasing the following candidate as Apache
Spark version 2.3.0. The vote is open until Tuesday
February 27, 2018 at 8:00:00 am UTC and passes if a
majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see
https://spark.apache.org/

The tag to be voted on is v2.3.0-rc5:
https://github.com/apache/spark/tree/v2.3.0-rc5

(992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

List of JIRA tickets resolved in this release can be found
here:
https://issues.apache.org/jira/projects/SPARK/versions/12339551


The release files, including signatures, digests, etc. can
be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/


Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS


The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1266/



The documentation corresponding to this release can be
found at:

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html




FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of
writing, there are currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release
by taking an existing Spark workload and running on this
release candidate, then reporting any regressions.

If you're working in PySpark you can set up a virtual env
and install the current RC and see if anything important
breaks, in the Java/Scala you can add the staging
repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going
forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely
important bug fixes, documentation, and API tweaks that
impact compatibility should be worked on immediately.
Everything else please retarget to 2.3.1 or 2.4.0 as
appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not
hold the release unless the bug in question is a
regression from 2.2.0. That being said, if there is
something which is a regression from 2.2.0 and has not
been correctly targeted please ping me or a committer to
help target the issue (you can see the open issues listed
as impacting Spark 2.3.0 at https://s.apache.org/WmoI).






Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
upgrade together looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:

> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:

   - Hadoop 3.2
   - Hive 2.3
   - JDK 11

We have already found and fixed some feature and performance regressions
related to these upgrades. Empirically, I’m not surprised at all if more
regressions are lurking somewhere. On the other hand, we do want help from
the community to help us to evaluate and stabilize these new changes.
Following that, I’d like to propose:

   1.

   Introduce a new profile hive-2.3 to enable (hopefully) less risky
   Hadoop/Hive/JDK version combinations.

   This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
   profile, so that users may try out some less risky Hadoop/Hive/JDK
   combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
   face potential regressions introduced by the Hadoop 3.2 upgrade.

   Yuming Wang has already sent out PR #26533
    to exercise the Hadoop 2.7
   + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
   profile yet), and the result looks promising: the Kafka streaming and Arrow
   related test failures should be irrelevant to the topic discussed here.

   After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
   of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
   version. For users who are still using Hadoop 2.x in production, they will
   have to use a hadoop-provided prebuilt package or build Spark 3.0
   against their own 2.x version anyway. It does make a difference for cloud
   users who don’t use Hadoop at all, though. And this probably also helps to
   stabilize the Hadoop 3.2 code path faster since our PR builder will
   exercise it regularly.
   2.

   Defer Hadoop 2.x upgrade to Spark 3.1+

   I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10.
   Steve has already stated the benefits very well. My worry here is still
   quality control: Spark 3.0 has already had tons of changes and major
   component version upgrades that are subject to all kinds of known and
   hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
   it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
   to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
   next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:

> i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
> they kept the public apis stable at the 2.7 level, because thats kind of
> the point. arent those the hadoop apis spark uses?
>
> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran 
> wrote:
>
>>
>>
>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>  wrote:
>>>
 It would be really good if the spark distributions shipped with later
 versions of the hadoop artifacts.

>>>
>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>> make it Hadoop 2.8 or something newer?
>>>
>>
>> go for 2.9
>>
>>>
>>> Koert Kuipers  wrote:
>>>
 given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
 to latest would probably be an issue for us.
>>>
>>>
>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>
>>
>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>> large proportion of the later branch-2 patches are backported. 2,7 was left
>> behind a long time ago
>>
>>
>>
>>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian  wrote:

> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>- Hadoop 3.2
>- Hive 2.3
>- JDK 11
>
> We have already found and fixed some feature and performance regressions
> related to these upgrades. Empirically, I’m not surprised at all if more
> regressions are lurking somewhere. On the other hand, we do want help from
> the community to help us to evaluate and stabilize these new changes.
> Following that, I’d like to propose:
>
>1.
>
>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>Hadoop/Hive/JDK version combinations.
>
>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>profile, so that users may try out some less risky Hadoop/Hive/JDK
>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>face potential regressions introduced by the Hadoop 3.2 upgrade.
>
>Yuming Wang has already sent out PR #26533
><https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
>profile yet), and the result looks promising: the Kafka streaming and Arrow
>related test failures should be irrelevant to the topic discussed here.
>
>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
>of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
>version. For users who are still using Hadoop 2.x in production, they will
>have to use a hadoop-provided prebuilt package or build Spark 3.0
>against their own 2.x version anyway. It does make a difference for cloud
>users who don’t use Hadoop at all, though. And this probably also helps to
>stabilize the Hadoop 3.2 code path faster since our PR builder will
>exercise it regularly.
>2.
>
>Defer Hadoop 2.x upgrade to Spark 3.1+
>
>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>2.10. Steve has already stated the benefits very well. My worry here is
>still quality control: Spark 3.0 has already had tons of changes and major
>component version upgrades that are subject to all kinds of known and
>hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>next 1 or 2 Spark 3.x releases.
>
> Cheng
>
> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>
>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>> but they kept the public apis stable at the 2.7 level, because thats kind
>> of the point. arent those the hadoop apis spark uses?
>>
>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>  wrote:
>>
>>>
>>>
>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>  wrote:
>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>
>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>> make it Hadoop 2.8 or something newer?
>>>>
>>>
>>> go for 2.9
>>>
>>>>
>>>> Koert Kuipers  wrote:
>>>>
>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>>> to latest would probably be an issue for us.
>>>>
>>>>
>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>
>>>
>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>> behind a long time ago
>>>
>>>
>>>
>>>
>>


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it
immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
be removed once the Hive 2.3 code paths are proven to be stable. If it
turned out to be buggy in Spark 3.1, we may want to further postpone
SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun 
wrote:

> Yes. It does. I meant SPARK-20202.
>
> Thanks. I understand that it can be considered like Scala version issue.
> So, that's the reason why I put this as a `policy` issue from the
> beginning.
>
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
>
> In the policy perspective, we should remove this immediately if we have a
> solution to fix this.
> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
> the current discussion status.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> And, if there is no other issues, I'll create a PR to remove it from
> `master` branch when we cut `branch-3.0`.
>
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:
>
>> It's kinda like Scala version upgrade. Historically, we only remove the
>> support of an older Scala version when the newer version is proven to be
>> stable after one or more Spark minor versions.
>>
>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:
>>
>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>> version. After all, for end-users and providers who need a particular
>>> version combination, they can always build Spark with proper profiles
>>> themselves.
>>>
>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>> it's due to the folder name.
>>>
>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>
>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>
>>>> We can replace it immediately if we want right now.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, Cheng.
>>>>>
>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>>> If we consider them, it could be the followings.
>>>>>
>>>>> +--+-++
>>>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>> +-+
>>>>> |Legitimate|X| O  |
>>>>> |JDK11 |X| O  |
>>>>> |Hadoop3   |X| O  |
>>>>> |Hadoop2   |O| O  |
>>>>> |Functions | Baseline|   More |
>>>>> |Bug fixes | Baseline|   More |
>>>>> +-+
>>>>>
>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>
>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>> to give more visibility to the whole community,
>>>>>
>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>> distribution
>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>>> after `branch-3.0` branch cut.
>>>>>
>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>> But, it's time to prepare. 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring both Hive 2.3.6 and 2.3.5 at the moment, see here

and here

.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
3.0. For preview releases, I'm afraid that their visibility is not good
enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
> if we can make a decision to eliminate the illegitimate Hive fork reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark community
> didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. Keeping in mind that
>> Spark 3.0 is already something that people understand works
>> differently. We can accept some behavior changes.
>>
>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>> > Also, this is orthogonal from `hadoop` version discussion.
>> >
>> > Apache Spark community kept (not maintained) the forked Apache Hive
>> > 1.2.1 because there has been no other options before. As we see at
>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-20202
>> >
>> > Also, please note that we `kept`, not `maintained`, because we know
>> it's not good.
>> > There are several attempt to update that forked repository
>> > for several reasons (Hadoop 3 support is one of the example),
>> > but those attempts are also turned down.
>> >
>> > From Apache Spark 3.0, it seems that we have a new feasible option
>> > `hive-2.3` profile. What about moving forward in this direction further?
>> >
>> > For example, can we remove the usage of forked `hive` in Apache Spark
>> 3.0
>> > completely officially? If someone still needs to use the forked `hive`,
>> we can
>> > have a profile `hive-1.2`. Of course, it should not be a default
>> profile in the community.
>> >
>> > I want to say this is a goal we should achieve someday.
>> > If we don't do anything, nothing happen. At least we need to prepare
>> this.
>> > Without any preparation, Spark 3.1+ will be the same.
>> >
>> > Shall we focus on what are our problems with Hive 2.3.6?
>> > If the only reason is that we didn't use it before, we can release
>> another
>> > `3.0.0-preview` for that.
>> >
>> > Bests,
>> > Dongjoon.
>>
>


Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Just to summarize my points:

   1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
   optional. End-users may choose between Hive 1.2/2.3 via a new profile
   (either adding a hive-1.2 profile or adding a hive-2.3 profile works for
   me, depending on which Hive version we pick as the default version).
   2. Decouple Hive version upgrade and Hadoop version upgrade, so that
   people may have more choices in production, and makes Spark 3.0 migration
   easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
   2.3 and/or JDK 11.).
   3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
   have a preference as long as the above two are met.


On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:

> Dongjoon, I don't think we have any conflicts here. As stated in other
> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
> can be decoupled, I have no preference over picking which Hive/Hadoop
> version as the default version. So the following two plans both work for me:
>
>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
>an extra hive-2.3 profile.
>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>have an extra hive-1.2 profile.
>
> BTW, I was also discussing Hive dependency issues with other people
> offline, and I realized that the Hive isolated client loader is not well
> known, and caused unnecessary confusion/worry. So I would like to provide
> some background context for readers who are not familiar with Spark Hive
> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
> you can only interact with Hive 1.2.1.*
>
> Spark does work with different versions of Hive metastore via an isolated
> classloading mechanism. *Even if Spark itself is built with the Hive
> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
> been true ever since Spark 1.x.* In order to do this, just set the
> following two options according to instructions in our official doc page
> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
> :
>
>- spark.sql.hive.metastore.version
>- spark.sql.hive.metastore.jars
>
> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
> dependencies from Maven at runtime when initializing the Hive metastore
> client. And those dependencies will NOT conflict with the built-in Hive
> 1.2.1 jars, because the downloaded jars are loaded using an isolated
> classloader (see here
> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
> Historically, we call these two sets of Hive dependencies "execution Hive"
> and "metastore Hive". The former is mostly used for features like SerDe,
> while the latter is used to interact with Hive metastore. And the Hive
> version upgrade we are discussing here is about the execution Hive.
>
> Cheng
>
> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
> wrote:
>
>> Nice. That's a progress.
>>
>> Let's narrow down to the path. We need to clarify what is the criteria we
>> can agree.
>>
>> 1. What does `battle-tested for years` mean exactly?
>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>
>> 2. What is the new "Hive integration in Spark"?
>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>> said.
>> Most of code is shared for Hive 1.2 and Hive 2.3.
>> That means if there is a bug inside this shared code, both of them
>> will be affected.
>> Of course, we can fix this because it's Spark code. We will learn and
>> fix it as you said.
>>
>> >  Yes, there are issues, but people have learned how to get along
>> with these issues.
>>
>> The only non-shared code are the following.
>> Do you have a concern on the following directories?
>> If there is no bugs on the following codebase, can we switch?
>>
>> $ find . -name v2.3.5
>> ./sql/core/v2.3.5
>> ./sql/hive-thriftserver/v2.3.5
>>
>> 3. We know that we can keep both code bases, but the community should
>> choose Hive 2.3 officially.
>> That's the right choice in the Apache project policy perspective. At
>> least, Sean and I prefer that.
>> If someone really want to stick to Hive 1.2 fork, they can use it at
>> their own risks.
>>
>> > for Spark 3.0 end-

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile
or not.

On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian  wrote:

> Just to summarize my points:
>
>1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>optional. End-users may choose between Hive 1.2/2.3 via a new profile
>(either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>me, depending on which Hive version we pick as the default version).
>2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>people may have more choices in production, and makes Spark 3.0 migration
>easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>2.3 and/or JDK 11.).
>3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>have a preference as long as the above two are met.
>
>
> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian  wrote:
>
>> Dongjoon, I don't think we have any conflicts here. As stated in other
>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>> can be decoupled, I have no preference over picking which Hive/Hadoop
>> version as the default version. So the following two plans both work for me:
>>
>>1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>have an extra hive-2.3 profile.
>>2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>have an extra hive-1.2 profile.
>>
>> BTW, I was also discussing Hive dependency issues with other people
>> offline, and I realized that the Hive isolated client loader is not well
>> known, and caused unnecessary confusion/worry. So I would like to provide
>> some background context for readers who are not familiar with Spark Hive
>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
>> you can only interact with Hive 1.2.1.*
>>
>> Spark does work with different versions of Hive metastore via an isolated
>> classloading mechanism. *Even if Spark itself is built with the Hive
>> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
>> been true ever since Spark 1.x.* In order to do this, just set the
>> following two options according to instructions in our official doc page
>> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
>> :
>>
>>- spark.sql.hive.metastore.version
>>- spark.sql.hive.metastore.jars
>>
>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>> dependencies from Maven at runtime when initializing the Hive metastore
>> client. And those dependencies will NOT conflict with the built-in Hive
>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>> classloader (see here
>> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
>> Historically, we call these two sets of Hive dependencies "execution Hive"
>> and "metastore Hive". The former is mostly used for features like SerDe,
>> while the latter is used to interact with Hive metastore. And the Hive
>> version upgrade we are discussing here is about the execution Hive.
>>
>> Cheng
>>
>> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
>> wrote:
>>
>>> Nice. That's a progress.
>>>
>>> Let's narrow down to the path. We need to clarify what is the criteria
>>> we can agree.
>>>
>>> 1. What does `battle-tested for years` mean exactly?
>>> How and when can we start the `battle-tested` stage for Hive 2.3?
>>>
>>> 2. What is the new "Hive integration in Spark"?
>>> During introducing Hive 2.3, we fixed the compatibility stuff as you
>>> said.
>>> Most of code is shared for Hive 1.2 and Hive 2.3.
>>> That means if there is a bug inside this shared code, both of them
>>> will be affected.
>>> Of course, we can fix this because it's Spark code. We will learn
>>> and fix it as you said.
>>>
>>> >  Yes, there are issues, but people have learned how to get along
>>> with these issues.
>>>
>>> The only non-shared code are the following.
>>> Do you have a concern on the following directories?
>>> If there is no bugs

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Dongjoon, I don't think we have any conflicts here. As stated in other
threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
can be decoupled, I have no preference over picking which Hive/Hadoop
version as the default version. So the following two plans both work for me:

   1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and have
   an extra hive-2.3 profile.
   2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and have
   an extra hive-1.2 profile.

BTW, I was also discussing Hive dependency issues with other people
offline, and I realized that the Hive isolated client loader is not well
known, and caused unnecessary confusion/worry. So I would like to provide
some background context for readers who are not familiar with Spark Hive
integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
you can only interact with Hive 1.2.1.*

Spark does work with different versions of Hive metastore via an isolated
classloading mechanism. *Even if Spark itself is built with the Hive 1.2.1
fork, you can still interact with a Hive 2.3 metastore, and this has been
true ever since Spark 1.x.* In order to do this, just set the following two
options according to instructions in our official doc page
<http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
:

   - spark.sql.hive.metastore.version
   - spark.sql.hive.metastore.jars

Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
"spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
dependencies from Maven at runtime when initializing the Hive metastore
client. And those dependencies will NOT conflict with the built-in Hive
1.2.1 jars, because the downloaded jars are loaded using an isolated
classloader (see here
<https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
Historically, we call these two sets of Hive dependencies "execution Hive"
and "metastore Hive". The former is mostly used for features like SerDe,
while the latter is used to interact with Hive metastore. And the Hive
version upgrade we are discussing here is about the execution Hive.

Cheng

On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun 
wrote:

> Nice. That's a progress.
>
> Let's narrow down to the path. We need to clarify what is the criteria we
> can agree.
>
> 1. What does `battle-tested for years` mean exactly?
> How and when can we start the `battle-tested` stage for Hive 2.3?
>
> 2. What is the new "Hive integration in Spark"?
> During introducing Hive 2.3, we fixed the compatibility stuff as you
> said.
> Most of code is shared for Hive 1.2 and Hive 2.3.
> That means if there is a bug inside this shared code, both of them
> will be affected.
> Of course, we can fix this because it's Spark code. We will learn and
> fix it as you said.
>
> >  Yes, there are issues, but people have learned how to get along
> with these issues.
>
> The only non-shared code are the following.
> Do you have a concern on the following directories?
> If there is no bugs on the following codebase, can we switch?
>
> $ find . -name v2.3.5
> ./sql/core/v2.3.5
> ./sql/hive-thriftserver/v2.3.5
>
> 3. We know that we can keep both code bases, but the community should
> choose Hive 2.3 officially.
> That's the right choice in the Apache project policy perspective. At
> least, Sean and I prefer that.
> If someone really want to stick to Hive 1.2 fork, they can use it at
> their own risks.
>
> > for Spark 3.0 end-users who really don't want to interact with this
> Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>
> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
> default Hive 2.3 pom at least?
> How do you think about that way, Cheng?
>
> Bests,
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian  wrote:
>
>> Hey Dongjoon and Felix,
>>
>> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
>> wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>>
>> However, *"Hive" and "Hive integration in Spark" are two quite different
>> things*, and I don't think anybody has ever mentioned "the forked Hive
>> 1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
>> double-checked all my replies).
>>
>> What I really care about is the stability and quality of "Hive
>> integration in Spark", which have gone through some major updates due to
>> the recent Hive 2.3 upgrade 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
Hey Nicholas,

Thanks for pointing this out. I just realized that I misread the
spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles,
"hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud
POM (here
<https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L174> and
here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L213>).
But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile
is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in
Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for
the misleading information.

Cheng



On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas 
wrote:

> > I don't think the default Hadoop version matters except for the
> spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian  wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>> matters except for the spark-hadoop-cloud module, which is only meaningful
>> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
>> Maven central are Hadoop-version-neutral.
>>
>> Another issue about switching the default Hadoop version to 3.2 is
>> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt
>> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency
>> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
>> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>>
>> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via
>> the proposed hive-2.3 profile, I personally don't have a preference over
>> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for
>> minimizing the release management work, in case we decided to publish other
>> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case
>> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>>
>> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
>> wrote:
>>
>>> I also agree with Steve and Felix.
>>>
>>> Let's have another thread to discuss Hive issue
>>>
>>> because this thread was originally for `hadoop` version.
>>>
>>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>>> `hadoop-3.0` versions.
>>>
>>> We don't need to mix both.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>>> wrote:
>>>
>>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>>> It is old and rather buggy; and It’s been *years*
>>>>
>>>> I think we should decouple hive change from everything else if people
>>>> are concerned?
>>>>
>>>> --
>>>> *From:* Steve Loughran 
>>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>>> *To:* Cheng Lian 
>>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>>> Dongjoon Hyun ; dev ;
>>>> Yuming Wang 
>>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>>
>>>> Can I take this moment to remind everyone that the version of hive
>>>> which spark has historically bundled (the org.spark-project one) is an
>>>> orphan project put together to deal with Hive's shading issues and a source
>>>> of unhappiness in the Hive project. What ever get shipped should do its
>>>> best to avoid including that file.
>>>>
>>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the
>>>> safest move from a risk minimisation perspective. If something has broken
>>>> then it is you can start with the assumption that it is in the o.a.s
>>>> packages without having to debug o.a.hadoop and o.a.hive first. There is a
>>>> cost: if there are problems with the hadoop / hive dependencies those teams
>>>> will inevitably ignore filed bug reports for the same reason spark team
>>>> will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for
>>>> the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear
>>>> that in mind. It's not been tested, it has dependencies on artifacts we
>>>> know are incompatible, and as far as the Hadoop project is concern

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely
officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun 
wrote:

> Yes. Right. That's the situation we are hitting and the result I expected.
> We need to change our default with Hive 2 in the POM.
>
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen  wrote:
>
>> Yes, good point. A user would get whatever the POM says without
>> profiles enabled so it matters.
>>
>> Playing it out, an app _should_ compile with the Spark dependency
>> marked 'provided'. In that case the app that is spark-submit-ted is
>> agnostic to the Hive dependency as the only one that matters is what's
>> on the cluster. Right? we don't leak through the Hive API in the Spark
>> API. And yes it's then up to the cluster to provide whatever version
>> it wants. Vendors will have made a specific version choice when
>> building their distro one way or the other.
>>
>> If you run a Spark cluster yourself, you're using the binary distro,
>> and we're already talking about also publishing a binary distro with
>> this variation, so that's not the issue.
>>
>> The corner cases where it might matter are:
>>
>> - I unintentionally package Spark in the app and by default pull in
>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>> causes other problems
>> - I run tests locally in my project, which will pull in a default
>> version of Hive defined by the POM
>>
>> Double-checking, is that right? if so it kind of implies it doesn't
>> matter. Which is an argument either way about what's the default. I
>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>> something about the implication?
>>
>> (That fork will stay published forever anyway, that's not an issue per
>> se.)
>>
>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun 
>> wrote:
>> > Sean, our published POM is pointing and advertising the illegitimate
>> Hive 1.2 fork as a compile dependency.
>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>> like that?
>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>> > Those artifacts will be orphans.
>> >
>>
>


Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Hey Dongjoon and Felix,

I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we
wouldn't even consider integrating with Hive 2.3 in Spark 3.0.

However, *"Hive" and "Hive integration in Spark" are two quite different
things*, and I don't think anybody has ever mentioned "the forked Hive
1.2.1 is stable" in any recent Hadoop/Hive version discussions (at least I
double-checked all my replies).

What I really care about is the stability and quality of "Hive integration
in Spark", which have gone through some major updates due to the recent
Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this piece, and
empirically, for a significant upgrade like this one, it is not surprising
that other bugs/regressions can be found in the near future. On the other
hand, the Hive 1.2 integration code path in Spark has been battle-tested
for years. Yes, there are issues, but people have learned how to get along
with these issues. And please don't forget that, for Spark 3.0 end-users
who really don't want to interact with this Hive 1.2 fork, they can always
use Hive 2.3 at their own risks.

True, "stable" is quite vague a criterion, and hard to be proven. But that
is exactly the reason why we may want to be conservative and wait for some
time and see whether there are further signals suggesting that the Hive 2.3
integration in Spark 3.0 is *unstable*. After one or two Spark 3.x minor
releases, if we've fixed all the outstanding issues and no more significant
ones are showing up, we can declare that the Hive 2.3 integration in Spark
3.x is stable, and then we can consider removing reference to the Hive 1.2
fork. Does that make sense?

Cheng

On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung 
wrote:

> Just to add - hive 1.2 fork is definitely not more stable. We know of a
> few critical bug fixes that we cherry picked into a fork of that fork to
> maintain ourselves.
>
>
> --
> *From:* Dongjoon Hyun 
> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
> *To:* Sean Owen 
> *Cc:* dev 
> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>
> Thanks. That will be a giant step forward, Sean!
>
> > I'd prefer making it the default in the POM for 3.0.
>
> Bests,
> Dongjoon.
>
> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen  wrote:
>
> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
> same old and buggy that's been there a while. "stable" in that sense
> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
> bug fixes that are important; the question isn't just 1.x releases.
>
> What I don't know is how much affects Spark, as it's a Hive client
> mostly. Clearly some do.
>
> I'd prefer making it the default in the POM for 3.0. Mostly on the
> grounds that its effects are on deployed clusters, not apps. And
> deployers can still choose a binary distro with 1.x or make the choice
> they want. Those that don't care should probably be nudged to 2.x.
> Spark 3.x is already full of behavior changes and 'unstable', so I
> think this is minor relative to the overall risk question.
>
> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > I'm sending this email because it's important to discuss this topic
> narrowly
> > and make a clear conclusion.
> >
> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> > by ignoring the existing bugs. If you want to say the forked Hive 1.2.1
> is
> > stabler than XXX, please give us the evidence. Then, we can fix it.
> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
> >
> > Historically, the following forked Hive 1.2.1 has never been stable.
> > It's just frozen. Since the forked Hive is out of our control, we
> ignored bugs.
> > That's all. The reality is a way far from the stable status.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
> (2015 August)
> >
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
> (2016 April)
> >
> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and
> 1.2.3,
> >
> > Apache Hive 1.2.2 has 50 bug fixes.
> > Apache Hive 1.2.3 has 9 bug fixes.
> >
> > I will not cover all of them, but Apache Hive community also backports
> > important patches like Apache Spark community.
> >
> > Second, let's move to SPARK issues because we aren't exposed to all Hive
> issues.
> >
> > SPARK-19109 ORC metadata section can sometimes exceed protobuf
> message size limit
> > SPARK-22267 Spark SQL incorrectly reads ORC file when column order
> is different
> >
> > These were reported since Apache Spark 1.6.x because the forked Hive
> doesn't have
> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
> >
> > Since we couldn't update the frozen forked Hive, we added Apache ORC
> dependency
> > at SPARK-20682 (2.3.0), added a switching 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Hmm, what exactly did you mean by "remove the usage of forked `hive` in
Apache Spark 3.0 completely officially"? I thought you wanted to remove the
forked Hive 1.2 dependencies completely, no? As long as we still keep the
Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
particular preference between using Hive 1.2 or 2.3 as the default Hive
version. After all, for end-users and providers who need a particular
version combination, they can always build Spark with proper profiles
themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
> renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>> If we consider them, it could be the followings.
>>
>> +--+-++
>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-+
>> |Legitimate|X| O  |
>> |JDK11 |X| O  |
>> |Hadoop3   |X| O  |
>> |Hadoop2   |O| O  |
>> |Functions | Baseline|   More |
>> |Bug fixes | Baseline|   More |
>> +-+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>> after `branch-3.0` branch cut.
>>
>> I know that we have been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:
>>
>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>> and here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>> .)
>>>
>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>> good enough for covering such major upgrades.
>>>
>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>
>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>> 3.1
>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>> reference
>>>> immediately after `branch-3.0` cut.
>>>>
>>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>>
>>>> -
>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>
>>>> The way I see this is that it's not a user problem. Apache Spark
>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>> We need to drop it by ourselves because we created it and it's our bad.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>>>
>>>>> Just to clarify, as even I have lost the details over time: ha

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the
support of an older Scala version when the newer version is proven to be
stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:

> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
> forked Hive 1.2 dependencies completely, no? As long as we still keep the
> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
> particular preference between using Hive 1.2 or 2.3 as the default Hive
> version. After all, for end-users and providers who need a particular
> version combination, they can always build Spark with proper profiles
> themselves.
>
> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
> due to the folder name.
>
> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
>
>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>
>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>
>> We can replace it immediately if we want right now.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Cheng.
>>>
>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>> If we consider them, it could be the followings.
>>>
>>> +--+-++
>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>> +-+
>>> |Legitimate|X| O  |
>>> |JDK11 |X| O  |
>>> |Hadoop3   |X| O  |
>>> |Hadoop2   |O| O  |
>>> |Functions | Baseline|   More |
>>> |Bug fixes | Baseline|   More |
>>> +-+
>>>
>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>> (including Jenkins/GitHubAction/AppVeyor).
>>>
>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>> to give more visibility to the whole community,
>>>
>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>> distribution
>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>> after `branch-3.0` branch cut.
>>>
>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>> But, it's time to prepare. Without them, we are going to be insufficient
>>> again and again.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
>>> wrote:
>>>
>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>> and here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>> .)
>>>>
>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>>> good enough for covering such major upgrades.
>>>>
>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>
>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>>> 3.1
>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>> reference
>>>>> immediately after `branch-3.0` cut.
>>>>>
>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>> `hadoop-2.7`.
>>>>>
>>>>> -
>>>>&

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian
Hey Steve,

In terms of Maven artifact, I don't think the default Hadoop version
matters except for the spark-hadoop-cloud module, which is only meaningful
under the hadoop-3.2 profile. All  the other spark-* artifacts published to
Maven central are Hadoop-version-neutral.

Another issue about switching the default Hadoop version to 3.2 is PySpark
distribution. Right now, we only publish PySpark artifacts prebuilt with
Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
3.2 is feasible for PySpark users. Or maybe we should publish PySpark
prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.

Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
proposed hive-2.3 profile, I personally don't have a preference over having
Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
the release management work, in case we decided to publish other spark-*
Maven artifacts from a Hadoop 2.7 build, we can still special case
spark-hadoop-cloud and publish it using a hadoop-3.2 build.

On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
wrote:

> I also agree with Steve and Felix.
>
> Let's have another thread to discuss Hive issue
>
> because this thread was originally for `hadoop` version.
>
> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
> `hadoop-3.0` versions.
>
> We don't need to mix both.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
> wrote:
>
>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
>> is old and rather buggy; and It’s been *years*
>>
>> I think we should decouple hive change from everything else if people are
>> concerned?
>>
>> ------
>> *From:* Steve Loughran 
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian 
>> *Cc:* Sean Owen ; Wenchen Fan ;
>> Dongjoon Hyun ; dev ;
>> Yuming Wang 
>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>
>> Can I take this moment to remind everyone that the version of hive which
>> spark has historically bundled (the org.spark-project one) is an orphan
>> project put together to deal with Hive's shading issues and a source of
>> unhappiness in the Hive project. What ever get shipped should do its best
>> to avoid including that file.
>>
>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>> move from a risk minimisation perspective. If something has broken then it
>> is you can start with the assumption that it is in the o.a.s packages
>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>> there are problems with the hadoop / hive dependencies those teams will
>> inevitably ignore filed bug reports for the same reason spark team will
>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>> in mind. It's not been tested, it has dependencies on artifacts we know are
>> incompatible, and as far as the Hadoop project is concerned: people should
>> move to branch 3 if they want to run on a modern version of Java
>>
>> It would be really really good if the published spark maven artefacts (a)
>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
>> That way people doing things with their own projects will get up-to-date
>> dependencies and don't get WONTFIX responses themselves.
>>
>> -Steve
>>
>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>> the transition release.
>>
>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>>
>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>> seemed risky, and therefore we only introduced Hive 2.3 under the
>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>> here...
>>
>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>> upgrade together looks too risky.
>>
>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16,