Fwd: LinearRegressionWithSGD accuracy

2015-01-16 Thread Robin East
Sent from my iPhone Begin forwarded message: From: Robin East robin.e...@xense.co.uk Date: 16 January 2015 11:35:23 GMT To: Joseph Bradley jos...@databricks.com Cc: Yana Kadiyska yana.kadiy...@gmail.com, Devl Devel devl.developm...@gmail.com Subject: Re: LinearRegressionWithSGD accuracy

Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies
Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user

RDD order guarantees

2015-01-16 Thread Ewan Higgs
Hi all, Quick one: when reading files, are the orders of partitions guaranteed to be preserved? I am finding some weird behaviour where I run sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If I open a python shell and run the following: for part in range(29): print

Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Michel Dufresne
Hi All, I'm trying to set some JVM options to the executor processes in a standalone cluster. Here's what I have in *spark-env.sh*: jmx_opt=-Dcom.sun.management.jmxremote jmx_opt=${jmx_opt} -Djava.net.preferIPv4Stack=true jmx_opt=${jmx_opt} -Dcom.sun.management.jmxremote.port=

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Zhan Zhang
You can try to add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three” Thanks. Zhan Zhang On Jan 16, 2015, at 9:56 AM, Michel Dufresne sparkhealthanalyt...@gmail.com wrote: Hi All, I'm trying to set some JVM

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Marcelo Vanzin
On Fri, Jan 16, 2015 at 10:07 AM, Michel Dufresne sparkhealthanalyt...@gmail.com wrote: Thank for your reply, I've should have mentioned that spark-env.sh is the only option i found because: - I'm creating the SpeakConf/SparkContext from a Play Application (therefore I'm not using

Re: RDD order guarantees

2015-01-16 Thread Reynold Xin
You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM,

Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies michael.belldav...@gmail.com wrote: Hi, It seems that a reasonably large proportion of

Spectral clustering

2015-01-16 Thread Andrew Musselman
Hi, thinking of picking up this Jira ticket: https://issues.apache.org/jira/browse/SPARK-4259 Anyone done any work on this to date? Any thoughts on it before we go too far in? Thanks! Best Andrew

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
Hi David, Yes, we are still headed in that direction. Please take a look at the repo I sent earlier. I think that's a good starting point. Thanks, -Kushal. On Thu, Jan 15, 2015 at 8:31 AM, David Robinson drobin1...@gmail.com wrote: I am new to Spark and GraphX, however, I use Tinkerpop

Re: Join implementation in SparkSQL

2015-01-16 Thread Alessandro Baretta
Reynold, The source file you are directing me to is a little too terse for me to understand what exactly is going on. Let me tell you what I'm trying to do and what problems I'm encountering, so that you might be able to better direct me investigation of the SparkSQL codebase. I am computing the

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kyle Ellrott
Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I only see a maven build file. Do you have some source code some place else? I've worked on a spark based implementation ( https://github.com/kellrott/spark-gremlin ), but its not done and I've been tied up on other projects.

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Alessandro Baretta
Reynold, Your clarification is much appreciated. One issue though, that I would strongly encourage you to work on, is to make sure that the Scaladoc CAN be generated manually if needed (a Use at your own risk clause would be perfectly legitimate here). The reason I say this is that currently even

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
The source code is under a new module named 'graphx'. let me double check. On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I only see a maven build file. Do you have some source code some place

Re: Join implementation in SparkSQL

2015-01-16 Thread Yin Huai
Hi Alex, Can you attach the output of sql(explain extended your query).collect.foreach(println)? Thanks, Yin On Fri, Jan 16, 2015 at 1:54 PM, Alessandro Baretta alexbare...@gmail.com wrote: Reynold, The source file you are directing me to is a little too terse for me to understand what

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
code updated. sorry, wrong branch uploaded before. On Fri, Jan 16, 2015 at 2:13 PM, Kushal Datta kushal.da...@gmail.com wrote: The source code is under a new module named 'graphx'. let me double check. On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: Looking at

Re: RDD order guarantees

2015-01-16 Thread Ewan Higgs
Yes, I am running on a local file system. Is there a bug open for this? Mingyu Kim reported the problem last April: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html -Ewan On 01/16/2015 07:41 PM, Reynold Xin wrote: You are running on a

Re: DBSCAN for MLlib

2015-01-16 Thread Muhammad Ali A'råby
Please find my answers on JIRA page. Muhammad-Ali On Thursday, January 15, 2015 3:25 AM, Xiangrui Meng men...@gmail.com wrote: Please find my comments on the JRIA page. -Xiangrui On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby angelland...@yahoo.com.invalid wrote: I have to