Re: Question on Spark code

2017-07-23 Thread tao zhan
Get it, thank you! On Mon, Jul 24, 2017 at 11:50 AM, Reynold Xin wrote: > This is a standard practice used for chaining, to support > > a.setStepSize(..) > .set setRegParam(...) > > > On Sun, Jul 23, 2017 at 8:47 PM, tao zhan wrote: > >> Thank you for re

Re: Question on Spark code

2017-07-23 Thread tao zhan
l that > function, it will return the same type with the level that you called it. > > On Sun, Jul 23, 2017 at 8:20 PM Reynold Xin wrote: > >> It means the same object ("this") is returned. >> >> On Sun, Jul 23, 2017 at 8:16 PM, tao zhan wrote: >> >

Question on Spark code

2017-07-23 Thread tao zhan
Hello, I am new to scala and spark. What does the "this.type" in set function for? ​ https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5e2596da1d600b9d0a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L48 Thanks! Zhan

SparkPlan/Shuffle stage reuse with Dataset/DataFrame

2016-10-18 Thread Zhan Zhang
Hi Folks, We have some Dataset/Dataframe use cases that will benefit from reuse the SparkPlan and shuffle stage. For example, the following cases. Because the query optimization and sparkplan is generated by catalyst when it is executed, as a result, the underlying RDD lineage is regenerated for

Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Zhan Zhang
I saw the pom file having hive version as 1.2.1.spark2. But I cannot find the branch in https://github.com/pwendell/ Does anyone know where the repo is? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive

Re: right outer joins on Datasets

2016-05-24 Thread Zhan
The first item as a whole should be null please refer to the jira. Sent from my iPhone > On May 24, 2016, at 7:31 AM, Koert Kuipers wrote: > > got it, but i assume thats an internal implementation detail, and it should > show null not -1? > >> On Tue, May 24, 2016 a

Re: right outer joins on Datasets

2016-05-24 Thread Zhan Zhang
The reason for "-1" is that the default value for Integer is -1 if the value is null def defaultValue(jt: String): String = jt match { ... case JAVA_INT => "-1" ... } -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Zhan Zhang
You can take a look at this blog from data bricks about GraphFrames https://databricks.com/blog/2016/03/03/introducing-graphframes.html Thanks. Zhan Zhang On Apr 21, 2016, at 12:53 PM, Robin East mailto:robin.e...@xense.co.uk>> wrote: Hi Aside from LDA, which is implemented in

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Zhan Zhang
FYI: There are several pending patches for DataFrame support on top of HBase. Thanks. Zhan Zhang On Apr 20, 2016, at 2:43 AM, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: +1, HBaseTest in Spark Example is quite old and obsolete, the HBase connector in HBase repo has evolved a l

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
Thanks Reynold. Not sure why doExecute is not invoked, since CollectLimit does not support wholeStage case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode { I will dig further into this. Zhan Zhang On Apr 18, 2016, at 10:36 PM, Reynold Xin mailto:r...@databricks.com

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
, SinglePartition, serializer)) shuffled.mapPartitionsInternal(_.take(limit)) } Thus, there is no way to avoid processing all data before the shuffle. I think that is the reason. Do I understand correctly? Thanks. Zhan Zhang On Apr 18, 2016, at 10:08 PM, Reynold Xin mailto:r...@databricks.com

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
>From the physical plan, the limit is one level up than the WholeStageCodegen, >Thus, I don’t think shouldStop would work here. To move it work, the limit has >to be part of the wholeStageCodeGen. Correct me if I am wrong. Thanks. Zhan Zhang On Apr 18, 2016, at 11:09 AM, Re

Re: more uniform exception handling?

2016-04-18 Thread Zhan Zhang
+1 Both of the would be very helpful in debugging Thanks. Zhan Zhang On Apr 18, 2016, at 1:18 PM, Evan Chan wrote: > +1000. > > Especially if the UI can help correlate exceptions, and we can reduce > some exceptions. > > There are some exceptions which are in practice ve

Re: ORC file writing hangs in pyspark

2016-02-23 Thread Zhan Zhang
Hi James, You can try to write with other format, e.g., parquet to see whether it is a orc specific issue or more generic issue. Thanks. Zhan Zhang On Feb 23, 2016, at 6:05 AM, James Barney mailto:jamesbarne...@gmail.com>> wrote: I'm trying to write an ORC file after running t

Dr.appointment this afternoon and WFH tomorrow for another Dr. appointment (EOM)

2016-01-07 Thread Zhan Zhang
- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Zhan Zhang
. Thanks. Zhan Zhang Note that when sc is stopped, all resources are released (for example in yarn On Dec 20, 2015, at 2:59 PM, Jerry Lam wrote: > Hi Spark developers, > > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave > correctly when a different spark context

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
I noticed that it is configurable in job level spark.task.cpus. Anyway to support on task level? Thanks. Zhan Zhang On Dec 11, 2015, at 10:46 AM, Zhan Zhang wrote: > Hi Folks, > > Is it possible to assign multiple core per task and how? Suppose we have some > scenario, i

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
make sense to add this feature. It may seems make user worry about more configuration, but by default we can still do 1 core per task and only advanced users need to be aware of this feature. Thanks. Zhan Zhang - To unsubscribe

Re: Proposal for SQL join optimization

2015-11-12 Thread Zhan Zhang
, and we can move the discussion there. Thanks. Zhan Zhang On Nov 11, 2015, at 6:16 PM, Xiao Li mailto:gatorsm...@gmail.com>> wrote: Hi, Zhan, That sounds really interesting! Please at me when you submit the PR. If possible, please also posted the performance difference. Thanks, X

Proposal for SQL join optimization

2015-11-11 Thread Zhan Zhang
are eliminated. Without such manual tuning, the query will never finish if a, c are big. But we should not relies on such manual optimization. Please provide your inputs. If they are both valid, I will open liras for each. Than

Re: Support for views/ virtual tables in SparkSQL

2015-11-09 Thread Zhan Zhang
I think you can rewrite those TPC-H queries not using view, for example registerTempTable Thanks. Zhan Zhang On Nov 9, 2015, at 9:34 PM, Sudhir Menon wrote: > Team: > > Do we plan to add support for views/ virtual tables in SparkSQL anytime soon? > Trying to run the TPC-H

Re: spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Zhan Zhang
It does not matter whether you start your spark with local or other mode. If you have hdfs-site.xml somewhere and spark configuration pointing to that config, you will read/write to HDFS. Thanks. Zhan Zhang From: Madhu Sent: Saturday, September 19

Re: Make off-heap store pluggable

2015-07-21 Thread Zhan Zhang
Hi Alexey, SPARK-6479<https://issues.apache.org/jira/browse/SPARK-6479> is for the plugin API, and SPARK-6112<https://issues.apache.org/jira/browse/SPARK-6112> is for hdfs plugin. Thanks. Zhan Zhang On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk mailto:alexey.goncha...@gmail

Re: Support for Hive 0.14 in secure mode on hadoop 2.6.0

2015-03-27 Thread Zhan Zhang
er in Spark SQL in 1.4” and "allows Spark SQL to connect to arbitrary Hive version” Thanks. Zhan Zhang On Mar 27, 2015, at 12:57 PM, Doug Balog wrote: > Is there a JIRA for this adaption layer ? It sounds like a better long term > solution. > > If anybody knows what is require t

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response. Thanks. Zhan Zhang On Mar 26, 2015, at 3:14 PM, Patrick Wendell wrote: > I think we have a version of mapPartitions that allows you to tell > Spark the partitioning is preserved: > > https://github.com/apache/spark/blob/master/core/src/main/scal

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
with keeping key part untouched. Then mapValues may not be able to do this. Changing the code to allow this is trivial, but I don’t know whether there is some special reason behind this. Thanks. Zhan Zhang On Mar 26, 2015, at 2:49 PM, Jonathan Coveney mailto:jcove...@gmail.com>> wro

RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
at :27 [] | ShuffledRDD[2] at reduceByKey at :25 [] +-(8) MapPartitionsRDD[1] at map at :23 [] | ParallelCollectionRDD[0] at parallelize at :21 [] Thanks. Zhan Zhang - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.o

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal mailto:anubha...@gmail.com>>

Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang
Thanks Reynold, Agree with you to open another JIRA to unify the block storage API. I have upload the design doc to SPARK-6479 as well. Thanks. Zhan Zhang On Mar 23, 2015, at 4:03 PM, Reynold Xin mailto:r...@databricks.com>> wrote: I created a ticket to separate the API refactorin

Re: Spark-thriftserver Issue

2015-03-23 Thread Zhan Zhang
Probably the port is already used by others, e.g., hive. You can change the port similar to below ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001 Thanks. Zhan Zhang On Mar 23, 2015, at 12:01 PM, Neil Dev mailto:neilk...@gmail.com

Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang
Thanks. Zhan Zhang

Re: Welcoming three new committers

2015-02-03 Thread Zhan Zhang
Congratulations! On Feb 3, 2015, at 2:34 PM, Matei Zaharia wrote: > Hi all, > > The PMC recently voted to add three new committers: Cheng Lian, Joseph > Bradley and Sean Owen. All three have been major contributors to Spark in the > past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Zhan Zhang
You can try to add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three” Thanks. Zhan Zhang On Jan 16, 2015, at 9:56 AM, Michel Dufresne wrote: > Hi All, > > I'm trying to set some JVM opti

Re: How spark and hive integrate in long term?

2014-11-22 Thread Zhan Zhang
some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible. Thanks. Zhan Zhang On Nov 22, 2014, at 7:14 AM, Cheng Lian wrote: > Should emphasize that this is still a quick and rough conclusion, will > investigate this in more detail

Re: How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
more features added, it would be great if user can take advantage of both. Current, spark sql give us such benefits partially, but I am wondering how to keep such integration in long term. Thanks. Zhan Zhang On Nov 21, 2014, at 3:12 PM, Dean Wampler wrote: > I can't comment on plans f

How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
on hive, e.g., metastore, thriftserver, hcatlog may not be able to help much. Does anyone have any insight or idea in mind? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html Sent

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Zhan Zhang
-Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0” is to enable hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but expected to go to upstream soon (Spark-3720). Thanks. Zhan Zhang On Oct 28, 2014, at 9:09 PM, Stephen Boesch wrote: > Thanks Pat

RE: Working Formula for Hive 0.13?

2014-08-28 Thread Zhan Zhang
issue to spark-2706 soon. Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p8118.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com

Re: spark.akka.frameSize stalls job in 1.1.0

2014-08-18 Thread Zhan Zhang
Not sure exactly how you use it. My understanding is that in spark it would be better to keep the overhead of driver as less as possible. Is it possible to broadcast trie to executors, do computation there and then aggregate the counters (??) in reduct phase? Thanks. Zhan Zhang On Aug 18

Re: spark.akka.frameSize stalls job in 1.1.0

2014-08-17 Thread Zhan Zhang
ne.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) counts.saveAsTextFile(“file”)//any way you don’t want to collect results to master, and instead putting them in file. Thanks. Zhan Zhang On Aug 16, 2014, at 9:18 AM, Jerry Ye wrote: > The job ended up running

Re: Spark testsuite error for hive 0.13.

2014-08-12 Thread Zhan Zhang
Problem solved by a walkaround with create database and use database. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-testsuite-error-for-hive-0-13-tp7807p7819.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: Spark testsuite error for hive 0.13.

2014-08-11 Thread Zhan Zhang
Thanks Sean, I change both the API and version because there are some incompatibility with hive-0.13, and actually can do some basic operation with the real hive environment. But the test suite always complain with no default database message. No clue yet. -- View this message in context: http

Spark testsuite error for hive 0.13.

2014-08-11 Thread Zhan Zhang
I am trying to change spark to support hive-0.13, but always met following problem when running the test. My feeling is the test setup may need to change, but don't know exactly. Who has the similar issue or is able to shed light on it? 13:50:53.331 ERROR org.apache.hadoop.hive.ql.Driver: FAILED:

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Zhan Zhang
Attached the diff the PR SPARK-2706. I am currently working on this problem. If somebody are also working on this, we can share the load. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p7782.html Sent from the Apach

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Zhan Zhang
Sorry, forget to upload files. I have never posted before :) hive.diff -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p.html Se

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Zhan Zhang
Here is the patch. Please ignore the pom.xml related change, which just for compiling purpose. I need to further work on this one based on Wandou's previous work. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p7776

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Zhan Zhang
I can compile with no error, but my patch also includes other stuff. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Working-Formula-for-Hive-0-13-tp7551p7775.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: Working Formula for Hive 0.13?

2014-08-08 Thread Zhan Zhang
The API change seems not major. I have locally change it and compiled, but not test yet. The major problem is still how to solve the hive-exec jar dependency. I am willing to help on this issue. Is it better stick to the same way as hive-0.12 until hive-exec is cleaned enough to switch back? --

Re: Spark REPL question

2014-04-17 Thread Zhan Zhang
Clear to me now. Thanks. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-REPL-question-tp6331p6335.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Spark REPL question

2014-04-17 Thread Zhan Zhang
Thanks a lot. By "spins up", do you mean using the same directory, specified by following? /** Local directory to save .class files too */ val outputDir = { val tmp = System.getProperty("java.io.tmpdir") val rootDir = new SparkConf().get("spark.repl.classdir", tmp)

Spark REPL question

2014-04-17 Thread Zhan Zhang
Please help, I am knew to both Spark and scala. I am trying to figure out how spark distribute the task to workers in REPL. I only found the place where task is serialized and sent, and workers deserialize and load the task with the class name by ExecutorClassLoader. But I didn't find how the dri