Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
/start-master.sh script) and at least one worker node (can > be started using SPARK_HOME/sbin/start-slave.sh script).SparkConf should > use master node address to create (spark://host:port) > > Thanks! > > Gangadhar > From: Li Jin <ice.xell...@gmail.com<mailto:ic

Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Hi, I am wondering does pyspark standalone (local) mode support multi cores/executors? Thanks, Li

Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin
I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-12 Thread Li Jin
Yeoul, I think a you can run an microbench for pyspark serialization/deserialization would be to run a withColumn + a python udf that returns a constant and compare that with similar code in Scala. I am not sure if there is way to measure just the serialization code, because pyspark API only

Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Hi All, This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been using Spark for time series analysis for the past two years and it has been a success to scale up our time series analysis. Recently, we start a conversation with Reynold about potential opportunities to collaborate