Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, Could you share the code with foreachPartition? Jacek 11.03.2016 7:33 PM "Darin McBeath" napisał(a): > > > I can verify this by looking at the log file for the workers. > > Since I output logging statements in the object called by the > foreachPartition, I can see the

Re: udf StructField to JSON String

2016-03-11 Thread Jacek Laskowski
Hi Tristan, Mind sharing the relevant code? I'd like to learn the way you use Transformer to do so. Thanks! Jacek 11.03.2016 7:07 PM "Tristan Nixon" napisał(a): > I have a similar situation in an app of mine. I implemented a custom ML > Transformer that wraps the Jackson

Re: adding rows to a DataFrame

2016-03-11 Thread Jacek Laskowski
Just a guess...flatMap? Jacek 11.03.2016 7:46 PM "Stefan Panayotov" napisał(a): > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of the

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Thanks Ted. I haven't explicitly specified Scala (I tried different versions in pom.xml as well). For what it is worth, this is what I get when I do a maven dependency tree. I wonder if the 2.11.2 coming from scala-reflect matters: [INFO] | | \- org.scala-lang:scalap:jar:2.11.0:compile

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Vasu Parameswaran
Thanks Jacek. Pom is below (Currenlty set to 1.6.1 spark but I started out with 1.6.0 with the same problem). http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
ok, some more information (and presumably a workaround). when I initial read in my file, I use the following code. JavaRDD keyFileRDD = sc.textFile(keyFile) Looking at the UI, this file has 2 partitions (both on the same executor). I then subsequently repartition this RDD (to 16)

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
Thanks for the pointer to those indexers, those are some good examples. A good way to go for the trainer and any scoring done in Spark. I will definitely have to deal with scoring in non-Spark systems though. I think I will need to scale up beyond what single-node liblinear can practically

hive.metastore.metadb.dir not working programmatically

2016-03-11 Thread harirajaram
Experts need your help, I'm using spark 1.4.1 and when set this hive.metastore.metadb.dir programmatically for a hivecontext i.e for local metastore i.e the default metastore_db for derby, the metastore_db is still getting creating in the same path as user.dir. Can you guys provide some insights

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Ted Yu
Looks like Scala version mismatch. Are you using 2.11 everywhere ? On Fri, Mar 11, 2016 at 10:33 AM, vasu20 wrote: > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time error as soon as new SparkConf() is called from

adding rows to a DataFrame

2016-03-11 Thread Stefan Panayotov
Hi, I have a problem that requires me to go through the rows in a DataFrame (or possibly through rows in a JSON file) and conditionally add rows depending on a value in one of the columns in each existing row. So, for example if I have: +---+---+---+ | _1| _2| _3| +---+---+---+

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Jacek Laskowski
Hi, Why do you use maven not sbt for Scala? Can you show the entire pom.xml and the command to execute the app? Jacek 11.03.2016 7:33 PM "vasu20" napisał(a): > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time

Re: Graphx

2016-03-11 Thread Khaled Ammar
This is an interesting discussion, I have had some success running GraphX on large graphs with more than a Billion edges using clusters of different size up to 64 machines. However, the performance goes down when I double the cluster size to reach 128 machines of r3.xlarge. Does any one have

Newbie question - Help with runtime error on augmentString

2016-03-11 Thread vasu20
Hi Any help appreciated on this. I am trying to write a Spark program using IntelliJ. I get a run time error as soon as new SparkConf() is called from main. Top few lines of the exception are pasted below. These are the following versions: Spark jar: spark-assembly-1.6.0-hadoop2.6.0.jar

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Josh Rosen
AFAIK we haven't actually broken 2.6 compatibility yet for PySpark itself, since Jenkins is still testing that configuration. I think the problem that you're seeing is that dev/run-tests / dev/run-tests-jenkins only work against Python 2.7+ right now. However, ./python/run-tests should be able to

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
My driver code has the following: // Init S3 (workers) so we can read the assets partKeyFileRDD.foreachPartition(new SimpleStorageServiceInit(arg1, arg2, arg3)); // Get the assets. Create a key pair where the key is asset id and the value is the rec. JavaPairRDD seqFileRDD =

Spark property parameters priority

2016-03-11 Thread Cesar Flores
Right now I know of three different things to pass property parameters to the Spark Context. They are: - A) Inside a SparkConf object just before creating the Spark Context - B) During job submission (i.e. --conf spark.driver.memory.memory = 2g) - C) By using a specific property file

How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-11 Thread prateek arora
Hi I have multiple node cluster and my spark jobs depend on a native library (.so files) and some jar files. Can some one please explain what are the best ways to distribute dependent files across nodes? right now i copied dependent files in all nodes using chef tool . Regards Prateek --

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Gayathri Murali
Thanks Josh. I am able to run with Python 2.7 explicitly by specifying --python-executables=python2.7. By default it checks only for Python2.6. Thanks Gayathri On Fri, Mar 11, 2016 at 10:35 AM, Josh Rosen wrote: > AFAIK we haven't actually broken 2.6 compatibility

<    1   2