Accessing log for lost executors

2016-12-01 Thread Nisrina Luthfiyati
Hi all, I'm trying to troubleshoot an ExecutorLostFailure issue. In Spark UI I noticed that executors tab only list active executors, is there any way that I can see the log for dead executors so that I can find out why it's dead/lost? I'm using Spark 1.5.2 on YARN 2.7.1. Thanks! Nisrina

Usage of -javaagent with spark.executor.extrajavaoptions configuration

2016-12-01 Thread Kanchan W
Hello, I am an apache spark newbie and have a question regarding spark.executor.extrajavaoptions configuration property present in spark 2.0.2 I have a requirement to start a javaagent on spark executor in standalone mode of spark interactive shell & Spark-submit. In order to do the same, I

Fwd: [Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread w.zhaokang
Hi all, In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid shuffle thus bringing us high join performance. In the new Dataset API in Spark 2.0, is the high performance shuffle-free join by co-partition mechanism still feasible? I have looked through the API doc but

[Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread Dale Wang
Hi all, In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid shuffle thus bringing us high join performance. In the new Dataset API in Spark 2.0, is the high performance shuffle-free join by co-partition mechanism still feasible? I have looked through the API doc but

Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Weiwei Zhang
Thanks Felix. Anyone know when this feature will be rolled out in GraphFrame? Best Regards, Weiwei On Thu, Dec 1, 2016 at 5:22 PM, Felix Cheung wrote: > That's correct - currently GraphFrame does not compute PageRank with > weighted edges. > > >

Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Felix Cheung
That's correct - currently GraphFrame does not compute PageRank with weighted edges. _ From: Weiwei Zhang > Sent: Thursday, December 1, 2016 2:41 PM Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank To:

Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Michal Šenkýř
Hello Vinayak, As I understand it, Spark creates a Derby metastore database in the current location, in the metastore_db subdirectory, whenever you first use an SQL context. This database cannot be shared by multiple instances. This should be controlled by the javax.jdo.option.ConnectionURL

RE: How to Check Dstream is empty or not?

2016-12-01 Thread bryan.jeffrey
The stream is just a wrapper over batch operations. You can check if a batch is empty by simply doing: val isEmpty = stream.transform(rdd => rdd.isEmpty) This will give you a stream of Boolean indicating if given batches are empty. Bryan Jeffrey From: rockinf...@gmail.com Sent: Thursday,

unsubscribe

2016-12-01 Thread Patnaik, Vandana

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Michael Armbrust
Yes ! On Thu, Dec 1, 2016 at 12:57 PM, ayan guha wrote: > Thanks TD. Will it be available in pyspark too? > On 1 Dec 2016 19:55, "Tathagata Das" wrote: > >> In

[GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Weiwei Zhang
Hi guys, I am trying to compute the pagerank for the locations in the following dummy dataframe, *srcdes shared_gas_stations* A B 2 A C 10 C E 3 D E 12 E G 5 ... I have tried the

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread ayan guha
Thanks TD. Will it be available in pyspark too? On 1 Dec 2016 19:55, "Tathagata Das" wrote: > In the meantime, if you are interested, you can read the design doc in the > corresponding JIRA - https://issues.apache.org/jira/browse/SPARK-18124 > > On Thu, Dec 1, 2016

quick question

2016-12-01 Thread kant kodali
Assume I am running a Spark Client Program in client mode and Spark Cluster in Stand alone mode. I want some clarification of the following things 1. Build a DAG 2. DAG Scheduler 3. TASK Scheduler I want to which of the above part is done by SPARK CLIENT and which of the above parts are done by

unsubscribe

2016-12-01 Thread Vishal Soni

support vector regression in spark

2016-12-01 Thread roni
Hi All, I want to know how can I do support vector regression in SPARK? Thanks R

Re: Spark-shell doesn't see changes coming from Kafka topic

2016-12-01 Thread Tathagata Das
Can you confirm the following? 1. Are you sending new data to the Kafka topic AFTER starting the streaming query? Since you have specified `*startingOffsets` *as* `latest`*, data needs to the topic after the query start for the query to receiver. 2. Are you able to read kafka data using Kafka's

Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-12-01 Thread shyla deshpande
Used SparkSession, Works now. Thanks. On Wed, Nov 30, 2016 at 11:02 PM, Deepak Sharma wrote: > In Spark > 2.0 , spark session was introduced that you can use to query > hive as well. > Just make sure you create spark session with enableHiveSupport() option. > > Thanks >

Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
This is the error received: 16/12/01 22:35:36 ERROR Schema: Failed initialising database. Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to

Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver) The following script/sequence works in Pyspark without any error against 1.6.x, but fails with 2.x. people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])

Unsubscribe

2016-12-01 Thread hardik nagda

RE: build models in parallel

2016-12-01 Thread Masood Krohy
You can use your groupId as a grid parameter, filter your dataset using this id in a pipeline stage, before feeding it to the model. The following may help: http://spark.apache.org/docs/latest/ml-tuning.html

Spark-shell doesn't see changes coming from Kafka topic

2016-12-01 Thread Otávio Carvalho
Hello hivemind, I am trying to connect my Spark 2.0.2 cluster to an Apache Kafka 0.10 cluster via spark-shell. The connection works fine, but it is not able to receive the messages published to the topic. It doesn't throw any error, but it is not able to retrieve any message (I am sure that

newly added Executors couldn't fetch jar files from Master

2016-12-01 Thread Evgenii Morozov
Hi I’ve got working cluster for more, than couple of weeks with 20 workers. Everything was perfect. Today I added 4 more workers and all of them couldn’t fetch jar files from master. The following means to me that master is available to worker, it is registered there and it started

How to Check Dstream is empty or not?

2016-12-01 Thread rockinf...@gmail.com
I have integerated flume with spark using Flume-style Push-based Approach. I need to check whether Dstream is empty. Please suggest how can i do that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Check-Dstream-is-empty-or-not-tp28151.html Sent

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread Marco Mistroni
Kant, We need to narrow it down to a reproducible code. You are using streaming What is the content of ur streamed data. If u provide that I can run a streaming programming that reads from a local dir and narrow down the problem I have seen similar error for doing something completely different.

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
sorry for multiple emails. I just think more info is needed every time to address this problem My Spark Client program runs in a client mode and it runs on a node that has 2 vCPU's and 8GB RAM (m4.large) I have 2 Spark worker nodes and each have 4 vCPU's and 16GB RAM (m3.xlarge for each spark

Re: Spark Job not exited and shows running

2016-12-01 Thread Selvam Raman
Hi, I have run the job in cluster mode as well. The job is not ending. After sometime the container just do nothing but it shows running. In my code, every record has been inserted into solr and cassandra as well. When i ran it only for solr the job completed successfully. Still i did not test

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
My batch interval is 1s slide interval is 1s window interval is 1 minute I am using a standalone alone cluster. I don't have any storage layer like HDFS. so I dont know what is a connection between RDD and blocks (I know that for every batch one RDD is produced)? what is a block in this context?

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Tathagata Das
In the meantime, if you are interested, you can read the design doc in the corresponding JIRA - https://issues.apache.org/jira/browse/SPARK-18124 On Thu, Dec 1, 2016 at 12:53 AM, Tathagata Das wrote: > That feature is coming in 2.1.0. We have added watermarking,

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Tathagata Das
That feature is coming in 2.1.0. We have added watermarking, that will track the event time of the data and accordingly close old windows, output its corresponding aggregate and then drop its corresponding state. But in that case, you will have to use append mode, and aggregated data of a

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
I also use this super(StorageLevel.MEMORY_AND_DISK_2()); inside my receiver On Wed, Nov 30, 2016 at 10:44 PM, kant kodali wrote: > Here is another transformation that might cause the error but it has to be > one of these two since I only have two transformations > >

RE: PySpark to remote cluster

2016-12-01 Thread Schaefers, Klaus
Hi, I moved my Pyspark to 2.0.1 and now I can connect. However, I cannot execute any job. I always get an "16/12/01 09:37:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources" error. I