Re: Applicaiton Detail UI change

2015-12-21 Thread MegaLearn
How do you start the Spark daemon, directly? https://issues.apache.org/jira/browse/SPARK-11570 If that's the case solution is to start by script, but I didn't read the whole thing. In my little world (currently 2-machine cluster soon move to 300) I have the same issue with 1.4.1, and I thought it

Re: TaskCompletionListener and Exceptions

2015-12-21 Thread Neelesh
I also created a JIRA for task failures https://issues.apache.org/jira/browse/SPARK-12452 On Mon, Dec 21, 2015 at 9:54 AM, Neelesh wrote: > I am leaning towards something like that. Things get interesting when > multiple different transformations and regrouping happen. At

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Not quite sure if error is resolved. Upon further probing, the setting spark.memory.offHeap.enabled is not getting applied in this build. When I print its value from core/src/main/scala/org/apache/spark/memory/MemoryManager.scala, it returns false even though the webUI is indicating that it's been

Re: rdd only with one partition

2015-12-21 Thread Zhiliang Zhu
You may just refer to my another letter with title : [Beg for help] spark job with very low efficiency On Tuesday, December 22, 2015 1:49 AM, Ted Yu wrote: I am not familiar with your use case, is it possible to perform the randomized  combination operation based

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Thanks Ted. That stack trace is from 1.5.1 build. I tried on the latest code as you suggested. Memory management seems to have changed quite a bit and this error has been fixed as well. :) Thanks for the help! Regards, ~Mayuresh On Mon, Dec 21, 2015 at 10:10 AM, Ted Yu

Re: spark-submit for dependent jars

2015-12-21 Thread Shixiong Zhu
Looks you need to add an "driver" option to your codes, such as sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:oracle:thin:@:1521:xxx", "driver" -> "oracle.jdbc.driver.OracleDriver", "dbtable" -> "your_table_name")).load() Best Regards, Shixiong Zhu 2015-12-21

error writing to stdout

2015-12-21 Thread carlilek
My users use Spark 1.5.1 in standalone mode on an HPC cluster, with a smattering still using 1.4.0 I have been getting reports of errors like this: 15/12/21 15:40:33 ERROR FileAppender: Error writing stream to file /scratch/spark/work/app-20151221150645-/3/stdout java.io.IOException: Stream

Re: Applicaiton Detail UI change

2015-12-21 Thread Josh Rosen
In the script / environment which launches your Spark driver, try setting the SPARK_PUBLIC_DNS environment variable to point to a publicly-accessible hostname. See https://spark.apache.org/docs/latest/configuration.html#environment-variables for more details. This environment variable also

Re: Applicaiton Detail UI change

2015-12-21 Thread Carlile, Ken
I start the spark master with $SPARK_HOME/sbin/start-master.sh, but I use the following to start the workers: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://$MASTER:7077 see my blog for more details, although I need to update the posts based on what I’ve changed

Re: fishing for help!

2015-12-21 Thread Igor Berman
look for differences: packages versions, cpu/network/memory diff etc etc On 21 December 2015 at 14:53, Eran Witkon wrote: > Hi, > I know it is a wide question but can you think of reasons why a pyspark > job which runs on from server 1 using user 1 will run faster then

Re: Spark with log4j

2015-12-21 Thread Igor Berman
I think log4j.properties that are under conf dir are those that are relevant for workers jvms and not the one that you pack withing your jar On 21 December 2015 at 14:07, Kalpesh Jadhav wrote: > Hi Ted, > > > > Thanks for your response, But it doesn’t solve my

Applicaiton Detail UI change

2015-12-21 Thread carlilek
I administer an HPC cluster that runs Spark clusters as jobs. We run Spark over the backend network (typically used for MPI), which is not accessible outside the cluster. Until we upgraded to 1.5.1 (from 1.3.1), this did not present a problem. Now the Application Detail UI link is returning the IP

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang
In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu
What is difference between repartition  / collect and   collapse ...Is collapse the same costly as collect or repartition ? Thanks in advance ~  On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang wrote: In what situation, you have such cases? If there is no

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang
What I mean is to combine multiple map functions into one. Don’t know how exactly your algorithms works. Did your one iteration result depend on last iteration? If so, how do they depend on? I think either you can optimize your implementation, or Spark is not the right one for your specific

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu
Dear Zhan, Thanks very much for your kind reply! You may just refer to my another letter with title : [Beg for help] spark job with very low efficiency I just need to apply spark to mathematica optimization by genetic algorithm , and  theoretically the algorithm would iterate for lots of

Re: fishing for help!

2015-12-21 Thread Michal Klos
If you are running on Amazon, then it's always a crapshoot as well. M > On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote: > > @Eran, are Server 1 and Server 2 both part of the same cluster / do they have > similar positions in the network topology w.r.t the Spark

Kafka Latency

2015-12-21 Thread Bryan
Hello. I am using spark 1.5.2 and the Kafka direct stream creation to load data from Kafka. We're processing around 200K messages/second in a cluster with the Kafka and Spark nodes collocated (same switch) without issue. However, when the Kafka broker is further away (even a couple of router

spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Siva
Hi Everyone, Observing a strange problem while submitting spark streaming job in yarn-cluster mode through spark-submit. All the executors are using only 1 Vcore irrespective value of the parameter --executor-cores. Are there any config parameters overrides --executor-cores value? Thanks,

Re: Spark with log4j

2015-12-21 Thread Siva
Hi Kalpseh, Just to add, you could use "yarn logs -applicationId " to see aggregated logs once application is finished. Thanks, Sivakumar Bhavanari. On Mon, Dec 21, 2015 at 3:56 PM, Zhan Zhang wrote: > Hi Kalpesh, > > If you are using spark on yarn, it may not work.

Re: Spark with log4j

2015-12-21 Thread Zhan Zhang
Hi Kalpesh, If you are using spark on yarn, it may not work. Because you write log to files other than stdout/stderr, which yarn log aggregation may not work. As I understand, yarn only aggregate log in stdout/stderr, and local cache will be deleted (in configured timeframe). To check it, at

Re: fishing for help!

2015-12-21 Thread Josh Rosen
@Eran, are Server 1 and Server 2 both part of the same cluster / do they have similar positions in the network topology w.r.t the Spark executors? If Server 1 had fast network access to the executors but Server 2 was across a WAN then I'd expect the job to run slower from Server 2 duet to the

Deployment and performance related queries for Spark and Cassandra

2015-12-21 Thread Ashish Gadkari
Hi We have configured total* 11 nodes*. Each node contains 8 cores and 32 GB RAM *Technologies and their version:* Apache Spark 1.5.2 and YARN : 6 nodes DSE 4.7 [Cassandra 2.1.8 and Solr] : 5 nodes HDFS (Hadoop version 2.7.1): 3 nodes *Stack:* 3 separate nodes

Re: hive on spark

2015-12-21 Thread Akhil Das
Looks like a version mismatch, you need to investigate more and make sure the versions satisfies. Thanks Best Regards On Sat, Dec 19, 2015 at 2:15 AM, Ophir Etzion wrote: > During spark-submit when running hive on spark I get: > > Exception in thread "main"

Re: TaskCompletionListener and Exceptions

2015-12-21 Thread Neelesh
I am leaning towards something like that. Things get interesting when multiple different transformations and regrouping happen. At the end of it all, when the "task" is done, we no longer are sure which kafka partition they came from, even when all transforms/ grouping happen local to the original

Re: Python 3.x support

2015-12-21 Thread MegaLearn
Just interpreting, if you follow your link through to https://github.com/apache/spark/pull/5173 they only say they are testing with 3.4 so I'd say it's a safe bet that only 3.4 is supported, from 1.5 forward. I hope they retire this saying at year's end, I will use it for the official last time

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Saisai Shao
Yes, basically from the currently implementation it should be. On Mon, Dec 21, 2015 at 6:39 PM, Arun Patel wrote: > So, Does that mean only one RDD is created by all receivers? > > > > On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao > wrote: > >>

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
So, Does that mean only one RDD is created by all receivers? On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao wrote: > Normally there will be one RDD in each batch. > > You could refer to the implementation of DStream#getOrCompute. > > > On Mon, Dec 21, 2015 at 11:04 AM,

Re: spark-submit for dependent jars

2015-12-21 Thread Madabhattula Rajesh Kumar
Hi Jeff and Satish, I have modified script and executed. Please find below command ./spark-submit --master local --class test.Main --jars /home/user/download/jar/ojdbc7.jar /home//test/target/spark16-0.0.1-SNAPSHOT.jar Still I'm getting same exception. Exception in thread "main"

configure spark for hive context

2015-12-21 Thread Divya Gehlot
Hi, I am trying to configure spark for hive context (Please dont get mistaken with hive on spark ) I placed hive-site.xml in spark/CONF_DIR Now when I run spark-shell I am getting below error Version which I am using *Hadoop 2.6.2 Spark 1.5.2 Hive 1.2.1 * Welcome to >

Re: pyspark streaming crashes

2015-12-21 Thread Antony Mayi
I noticed it might be related to longer GC pauses (1-2 sec) - the crash usually occurs after such pause. could that be causing the python-java gateway timing out? On Sunday, 20 December 2015, 23:05, Antony Mayi wrote: Hi, can anyone please help me

Re: spark-submit for dependent jars

2015-12-21 Thread Jeff Zhang
Put /test/target/spark16-0.0.1-SNAPSHOT.jar as the last argument ./spark-submit --master local --class test.Main --jars /home/user/download/jar/ojdbc7.jar /test/target/spark16-0.0.1-SNAPSHOT.jar On Mon, Dec 21, 2015 at 9:15 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > >

Re: spark-submit for dependent jars

2015-12-21 Thread satish chandra j
Hi Rajesh, Could you please try giving your cmd as mentioned below: ./spark-submit --master local --class --jars Regards, Satish Chandra On Mon, Dec 21, 2015 at 6:45 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > How to add dependent jars in spark-submit command. For

Re: spark-submit for dependent jars

2015-12-21 Thread Jeff Zhang
Please make sure this is correct jdbc url, jdbc:oracle:thin:@:1521:xxx On Mon, Dec 21, 2015 at 9:54 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Jeff and Satish, > > I have modified script and executed. Please find below command > > ./spark-submit --master local

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Mayuresh Kunjir
Any intuition on this? ​~Mayuresh​ On Thu, Dec 17, 2015 at 8:04 PM, Mayuresh Kunjir wrote: > I am testing a simple Sort program written using Dataframe APIs. When I > enable spark.unsafe.offHeap, the output stage fails with a NPE. The > exception when run on spark-1.5.1

Re: trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-21 Thread Jeff Zhang
In your case, I would suggest you to extends UnaryTransformer which is much easier. Yeah, I have to admit that there's no document about how to write a custom Transformer, I think we need to add that, since writing custom Transformer is a very typical work in machine learning. On Tue, Dec 22,

Extract SSerr SStot from Linear Regression using ml package

2015-12-21 Thread Arunkumar Pillai
Hi I'm using Linear Regression using ml package I'm able to see SSerr SStot and SSreg from val model = lr.fit(dat1) model.summary.metric But this metric is not accessible. It would be good if we can get those values. Any suggestion -- Thanks and Regards Arun

Re: error writing to stdout

2015-12-21 Thread Noorul Islam K M
carlilek writes: > My users use Spark 1.5.1 in standalone mode on an HPC cluster, with a > smattering still using 1.4.0 > > I have been getting reports of errors like this: > > 15/12/21 15:40:33 ERROR FileAppender: Error writing stream to file >

trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-21 Thread Andy Davidson
I am trying to port the following python function to Java 8. I would like my java implementation to implement Transformer so I can use it in a pipeline. I am having a heck of a time trying to figure out how to create a Column variable I can pass to DataFrame.withColumn(). As far as I know

RE: Spark with log4j

2015-12-21 Thread Kalpesh Jadhav
Hi Siva, Through this command it doesn’t print log.info messages whatever I have written in application. Thanks, Kalpesh Jadhav From: Siva [mailto:sbhavan...@gmail.com] Sent: Tuesday, December 22, 2015 6:27 AM To: Zhan Zhang Cc: Kalpesh Jadhav; user@spark.apache.org Subject: Re: Spark

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
Hi Siva, How did you know that --executor-cores is ignored and where did you see that only 1 Vcore is allocated? Thanks Saisai On Tue, Dec 22, 2015 at 9:08 AM, Siva wrote: > Hi Everyone, > > Observing a strange problem while submitting spark streaming job in >

spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-21 Thread our...@cnsuning.com
spark streaming updateStateByKey state no support Array type without classTag? how to slove the problem? def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] = ssc.withScope { updateStateByKey(updateFunc, defaultPartitioner()) } ClassTag not

RE: Spark with log4j

2015-12-21 Thread Kalpesh Jadhav
Hi Zhan, Yes I am running spark on yarn, is there any alternative you came across to get that logs in files. Thanks, Kalpesh Jadhav From: Zhan Zhang [mailto:zzh...@hortonworks.com] Sent: Tuesday, December 22, 2015 5:27 AM To: Kalpesh Jadhav Cc: user@spark.apache.org Subject: Re:

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
I guess you're using DefaultResourceCalculator for capacity scheduler, can you please check you capacity scheduler configuration? By default, this resource calculator will only honor memory as resource calculator, so vcores will always show 1 not matter what values you set (but Spark internally

Re: How to keep long running spark-shell but avoid hitting Java Out of Memory Exception: PermGen Space

2015-12-21 Thread Jung
I was faced with a same problem too. As a result, this is not a Spark problem, but Scala. And I think it may not be a case of memory leak. Spark-shell basically implements Scala REPL and it is originally designed as short time application, not 24/7 application. Scala-shell uses so many objects

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Siva
Hi Saisai, Total Vcores available in yarn applications web UI (runs on 8088) before and after only varies with number of executors + driver core. If I give 10 executors, I see only 11 vcores being used in yarn application web UI. Thanks, Sivakumar Bhavanari. On Mon, Dec 21, 2015 at 5:21 PM,

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Zhan Zhang
BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is ignored. If you want Yarn to honor vcore requests, you have to use DominantResourceCalculator as Saisai suggested. Thanks. Zhan Zhang On Dec 21, 2015, at 5:30 PM, Saisai Shao

Re: ​Spark 1.6 - YARN Cluster Mode

2015-12-21 Thread Akhil Das
Try adding these properties: spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950 spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.0-2950 ​There was a similar discussion with Spark 1.3.0 over here http://stackoverflow.com/questions/29470542/spark-1-3-0-running-pi-example-on-yarn-fails ​

How to implement statemachine functionality in apache-spark by python

2015-12-21 Thread Esa Heikkinen
Hi I am newbie with apache spark and i would want to know or find good example python codes how to implement (finite) statemachine functionality in spark. I try to read many different log files to find certain events by specific order. Is this possible or even impossible ? Or is that only

Re: Error on using updateStateByKey

2015-12-21 Thread Akhil Das
You can do it like this: private static Function2 UPDATEFUNCTION = new Function2() { @Override public Optional call(List nums, Optional current) throws Exception { long sum = current.or(0L);

Re: custom schema in spark throwing error

2015-12-21 Thread VISHNU SUBRAMANIAN
Try this val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true) )) On Mon, Dec 21, 2015 at 8:26 AM, Divya Gehlot wrote: > >1. scala> import

Re: One task hangs and never finishes

2015-12-21 Thread Akhil Das
Pasting the relevant code might help to understand better what exactly you are doing. Thanks Best Regards On Thu, Dec 17, 2015 at 9:25 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > > I have an application running a set of transformations and finishes with > saveAsTextFile.

rdd only with one partition

2015-12-21 Thread Zhiliang Zhu
Dear All, For some rdd, while there is just one partition, then the operation & arithmetic would only be single, the rdd has lose all the parallelism benefit from spark  system ... Is it exactly like that? Thanks very much in advance!Zhiliang

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Zhiliang Zhu
Dear Sab , I must appreciate your kind reply very much, it would be much helpful. On Monday, December 21, 2015 8:49 PM, Sabarish Sasidharan wrote: collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you

Re: fishing for help!

2015-12-21 Thread Eran Witkon
I'll check it out. On Tue, 22 Dec 2015 at 00:30 Michal Klos wrote: > If you are running on Amazon, then it's always a crapshoot as well. > > M > > On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote: > > @Eran, are Server 1 and Server 2 both part of

Fat jar can't find jdbc

2015-12-21 Thread David Yerrington
Hi Everyone, I'm building a prototype that fundamentally grabs data from a MySQL instance, crunches some numbers, and then moves it on down the pipeline. I've been using SBT with assembly tool to build a single jar for deployment. I've gone through the paces of stomping out many dependency

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu
Thanks a lot for Zhan's comment, it really offered much help. On Tuesday, December 22, 2015 5:11 AM, Zhan Zhang wrote: What I mean is to combine multiple map functions into one. Don’t know how exactly your algorithms works. Did your one iteration result

RE: Spark with log4j

2015-12-21 Thread Kalpesh Jadhav
Hi Ted, Thanks for your response, But it doesn’t solve my issue. Still print logs on console only. Thanks, Kalpesh Jadhav. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, December 18, 2015 9:15 PM To: Kalpesh Jadhav Cc: user Subject: Re: Spark with log4j See this thread:

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Sabarish Sasidharan
collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you could use rdd1.checkpoint() along with a more efficient action like rdd1.count(). This you can do within the for loop. Hopefully you are using the Kryo serializer already. Regards Sab On Mon,

fishing for help!

2015-12-21 Thread Eran Witkon
Hi, I know it is a wide question but can you think of reasons why a pyspark job which runs on from server 1 using user 1 will run faster then the same job when running on server 2 with user 1 Eran

RE: fishing for help!

2015-12-21 Thread David Newberger
Hi Eran, Based on the limited information the first things that come to my mind are Processor, RAM, and Disk speed. David Newberger QA Analyst WAND - The Future of Restaurant Technology (W) www.wandcorp.com (E)

number limit of map for spark

2015-12-21 Thread Zhiliang Zhu
Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time.Is there any proper way ... As tested, there

get parameters of spark-submit

2015-12-21 Thread Bonsen
1.I code my scala class and pack.(not input the hdfs files' paths,just use the paths from "spark-submit"'s parameters) 2.Then,If I input like this: ${SPARK_HOME/bin}/spark-submit \ --master \ \ hdfs:// \ hdfs:// \ what should I do to get the two hdfs files' paths in my scala class's code(before

spark-submit for dependent jars

2015-12-21 Thread Madabhattula Rajesh Kumar
Hi, How to add dependent jars in spark-submit command. For example: Oracle. Could you please help me to resolve this issue I have a standalone cluster. One Master and One slave. I have used below command it is not working ./spark-submit --master local --class test.Main

Re: get parameters of spark-submit

2015-12-21 Thread Jeff Zhang
don't understand your question. These parameter are passed to your program as args of the main function. On Mon, Dec 21, 2015 at 9:09 PM, Bonsen wrote: > 1.I code my scala class and pack.(not input the hdfs files' paths,just use > the paths from "spark-submit"'s parameters)

Re: TaskCompletionListener and Exceptions

2015-12-21 Thread Cody Koeninger
Honestly it's a lot easier to deal with this using transactions. Someone else would have to speak to the possibility of getting task failures added to listener callbacks. On Sat, Dec 19, 2015 at 5:44 PM, Neelesh wrote: > Hi, > I'm trying to build automatic Kafka watermark

Re: Problem with Spark Standalone

2015-12-21 Thread MegaLearn
Perhaps put the master IP address in this line and try again? setMaster("spark://:7077"). Replace with hostname, but the way our host files are setup I have to put the IP address there. -- View this message in context:

Re: Problem with Spark Standalone

2015-12-21 Thread luca_guerra
Hi MegaLearn! thanks for the reply! it's a placeholder, in my real application I use the right master's hostname. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Spark-Standalone-tp25750p25752.html Sent from the Apache Spark User List

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
Have you tried the following method ? * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000,

argparse with pyspark

2015-12-21 Thread Roberto Pagliari
Is argparse compatible with pyspark? If so, how do I provide parameters from command line? It does not seem to work the usual way. Thank you,

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Ted Yu
w.r.t. at org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:202) I looked at UnsafeExternalRowSorter.java in 1.6.0 which only has 192 lines of code. Can you run with latest RC of 1.6.0 and paste the stack trace ? Thanks On Thu, Dec 17,

Problem with Spark Standalone

2015-12-21 Thread luca_guerra
Hi, I’m trying to submit a streaming application on my standalone Spark cluster, this is my code: import akka.actor.{Props, ActorSystem, Actor} import akka.http.scaladsl.Http import akka.http.scaladsl.model.HttpRequest import akka.http.scaladsl.model.Uri import akka.stream.ActorMaterializer

Difference in AUCs b/w Spark's GBT and sklearn's

2015-12-21 Thread Yahoo_SK
I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation with default settings for a binary classification problem. I kept the numIterations, loss function same in both the cases. The features are all real valued and continuous. However, the AUC in

GMM with diagonal covariance matrix

2015-12-21 Thread Jaonary Rabarisoa
Hi all, Is it possible to learn a gaussian mixture model with a diagonal covariance matrix in the GMM algorithm implemented in MLIb ? It seems to be possible but can't figure out how to do that. Cheers, Jao

Re: Kafka - streaming from multiple topics

2015-12-21 Thread Cody Koeninger
Spark streaming by default wont start the next batch until the current batch is completely done, even if only a few cores are still working. This is generally a good thing, otherwise you'd have weird ordering issues. Each topicpartition is separate. Unbalanced partitions can happen either

is Kafka Hard to configure? Does it have a high cost of ownership?

2015-12-21 Thread Andy Davidson
Hi I realize this is a little off topic. My project needs to install something like Kafka. The engineer working on that part of the system has been having a lot of trouble configuring a single node implementation. He has lost a lot of time and wants to switch to something else. Our team does not

Re: Kafka - streaming from multiple topics

2015-12-21 Thread Neelesh
Thanks Cody. My case is #2. Just wanted to confirm when you say different spark jobs, do you mean one spark-submit per topic, or just use different threads in the driver to submit the job? Thanks! On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger wrote: > Spark streaming by

Re: rdd only with one partition

2015-12-21 Thread Zhiliang Zhu
Hi Ted, Thanks a lot for your kind reply. I needs to convert this rdd0 into another rdd1, rows of  rdd1 are generated from rdd0's row randomly combination operation.From that perspective, rdd0 would be with one partition in order to randomly operate on its all rows, however, it would also lose

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
I am not familiar with your use case, is it possible to perform the randomized combination operation based on subset of the rows in rdd0 ? That way you can increase the parallelism. Cheers On Mon, Dec 21, 2015 at 9:40 AM, Zhiliang Zhu wrote: > Hi Ted, > > Thanks a lot for

Using inteliJ for spark development

2015-12-21 Thread Eran Witkon
Any pointers how to use InteliJ for spark development? Any way to use scala worksheet run like spark- shell?

Re: Kafka - streaming from multiple topics

2015-12-21 Thread Cody Koeninger
Different spark-submit per topic. On Mon, Dec 21, 2015 at 11:36 AM, Neelesh wrote: > Thanks Cody. My case is #2. Just wanted to confirm when you say different > spark jobs, do you mean one spark-submit per topic, or just use different > threads in the driver to submit the

Re: is Kafka Hard to configure? Does it have a high cost of ownership?

2015-12-21 Thread Cody Koeninger
Compared to what alternatives? Honestly, if someone actually read the kafka docs, yet is still having trouble getting a single test node up and running, the problem is probably them. Kafka's docs are pretty good. On Mon, Dec 21, 2015 at 11:31 AM, Andy Davidson < a...@santacruzintegration.com>

Re: Problem with Spark Standalone

2015-12-21 Thread MegaLearn
Gotcha, then you are also replacing the cluster IP. Missed that. I would ask you to post the actual logfiles, not sure I'll be able to help but hopefully it gives more info that someone can work with :) -- View this message in context:

Re: Problem with Spark Standalone

2015-12-21 Thread MegaLearn
I was going off this, not sure if it gives you a clue: http://doc.akka.io/api/akka/2.4.0/index.html#akka.remote.transport.Transport$$InvalidAssociationException "Indicates that the association setup request is invalid, and it is impossible to recover (malformed IP address, hostname, etc.)." I