Accessing log for lost executors

2016-12-01 Thread Nisrina Luthfiyati
Hi all,

I'm trying to troubleshoot an ExecutorLostFailure issue.
In Spark UI I noticed that executors tab only list active executors, is
there any way that I can see the log for dead executors so that I can find
out why it's dead/lost?
I'm using Spark 1.5.2 on YARN 2.7.1.

Thanks!
Nisrina


Usage of -javaagent with spark.executor.extrajavaoptions configuration

2016-12-01 Thread Kanchan W
Hello,

I am an apache spark newbie and have a question regarding
spark.executor.extrajavaoptions configuration property present in spark
2.0.2

I have a requirement to start a javaagent on spark executor in standalone
mode of spark interactive shell & Spark-submit. In order to do the same, I
thought of using "spark.executor.extrajavaoptions" configuration property
and passing it through "properties-file" option.

Interestingly when I pass javaagent in "spark.driver.extrajavaoptions", it
works just fine, but I need to do it at executor level and somehow
spark.executor.extrajavaoptions=javaagent: does not work.

I am not sure why this is happening. Is it part of spark design itself? Can
you give pointers on why javaagent does not work with executor but works
just fine with driver? Any idea what I may be doing wrong here?

I really appreciate your help and support in this regard.

Thanks and Regards
Kanchan


Fwd: [Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread w.zhaokang
Hi all,

In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid
shuffle thus bringing us high join performance.

In the new Dataset API in Spark 2.0, is the high performance shuffle-free
join by co-partition mechanism still feasible? I have looked through the
API doc but failed. Will the Catalyst Optimizer handle the co-partition in
its query plan optimization process?

Thanks a lot if anyone can provide any clue on the problem :-)

Zhaokang(Dale) Wang




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-Dataset-How-to-conduct-co-partition-join-in-the-new-Dataset-API-in-Spark-2-0-tp28152.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread Dale Wang
Hi all,

In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid
shuffle thus bringing us high join performance.

In the new Dataset API in Spark 2.0, is the high performance shuffle-free
join by co-partition mechanism still feasible? I have looked through the
API doc but failed. Will the Catalyst Optimizer handle the co-partition in
its query plan optimization process?

Thanks a lot if anyone can provide any clue on the problem :-)

Zhaokang(Dale) Wang


Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Weiwei Zhang
Thanks Felix. Anyone know when this feature will be rolled out in
GraphFrame?

Best Regards,
Weiwei

On Thu, Dec 1, 2016 at 5:22 PM, Felix Cheung 
wrote:

> That's correct - currently GraphFrame does not compute PageRank with
> weighted edges.
>
>
> _
> From: Weiwei Zhang 
> Sent: Thursday, December 1, 2016 2:41 PM
> Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank
> To: user 
>
>
>
> Hi guys,
>
> I am trying to compute the pagerank for the locations in the following
> dummy dataframe,
>
> *srcdes  shared_gas_stations*
>  A   B   2
>  A   C  10
>  C   E   3
>  D   E  12
>  E   G   5
> ...
>
> I have tried the function *graphframe.pageRank(resetProbability=0.01,
> maxIter=20)* in GraphFrame but it seems like this function doesn't take
> weighted edges. Maybe I am not using it correctly. How can I pass the
> weighted edges to this function? Also I am not sure if this function works
> for the undirected graph.
>
>
> Thanks a lot!
>
> - Weiwei
>
>
>


Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Felix Cheung
That's correct - currently GraphFrame does not compute PageRank with weighted 
edges.


_
From: Weiwei Zhang >
Sent: Thursday, December 1, 2016 2:41 PM
Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank
To: user >


Hi guys,

I am trying to compute the pagerank for the locations in the following dummy 
dataframe,

srcdes  shared_gas_stations
 A   B   2
 A   C  10
 C   E   3
 D   E  12
 E   G   5
...

I have tried the function graphframe.pageRank(resetProbability=0.01, 
maxIter=20) in GraphFrame but it seems like this function doesn't take weighted 
edges. Maybe I am not using it correctly. How can I pass the weighted edges to 
this function? Also I am not sure if this function works for the undirected 
graph.


Thanks a lot!

- Weiwei




Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Michal Šenkýř
Hello Vinayak,

As I understand it, Spark creates a Derby metastore database in the current
location, in the metastore_db subdirectory, whenever you first use an SQL
context. This database cannot be shared by multiple instances.
This should be controlled by the  javax.jdo.option.ConnectionURL property.
I can imagine that using another kind of metastore database, like an
in-memory or server-client db, would solve this specific problem. However,
I do not think it is advisable.
Is there a specific reason why you are creating a second SQL context? I
think it is meant to be created only once per application and passed around.
I also have no idea why the behavior changed between Spark 1.6 and Spark
2.0.

Michal Šenkýř

On Thu, Dec 1, 2016, 18:33 Vinayak Joshi5  wrote:

> This is the error received:
>
>
> 16/12/01 22:35:36 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url =
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
> Terminating connection pool (set lazyInit to true if you expect to start
> your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4494053,
> see the next exception for details.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> .
> .
> --
>
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a
> test connection to the given database. JDBC url =
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
> Terminating connection pool (set lazyInit to true if you expect to start
> your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class
> loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd,
> see the next exception for details.
> at org.apache.derby.impl.jdb
> .
> .
> .
> NestedThrowables:
> java.sql.SQLException: Unable to open a test connection to the given
> database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true,
> username = APP. Terminating connection pool (set lazyInit to true if you
> expect to start your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class
> loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd,
> see the next exception for details.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> .
> .
> .
> Caused by: java.sql.SQLException: Unable to open a test connection to the
> given database. JDBC url =
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
> Terminating connection pool (set lazyInit to true if you expect to start
> your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class
> loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd,
> see the next exception for details.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown
> Source)
> .
> .
> .
> 16/12/01 22:48:09 ERROR Schema: Failed initialising database.
> Unable to open a test connection to the given database. JDBC url =
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
> Terminating connection pool (set lazyInit to true if you expect to start
> your database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class
> loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd,
> see the next exception for details.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown
> Source)
> .
> .
> .
> Caused by: java.sql.SQLException: Failed to start database 'metastore_db'
> with class loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd,
> see the next exception for details.
> at
> 

RE: How to Check Dstream is empty or not?

2016-12-01 Thread bryan.jeffrey
The stream is just a wrapper over batch operations. You can check if a batch is 
empty by simply doing:

val isEmpty = stream.transform(rdd => rdd.isEmpty)

This will give you a stream of Boolean indicating if given batches are empty.

Bryan Jeffrey


From: rockinf...@gmail.com
Sent: Thursday, December 1, 2016 7:28 AM
To: user@spark.apache.org
Subject: How to Check Dstream is empty or not?

I have integerated flume with spark using Flume-style Push-based Approach. 
I need to check whether Dstream is empty. Please suggest how can i do that?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Check-Dstream-is-empty-or-not-tp28151.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




unsubscribe

2016-12-01 Thread Patnaik, Vandana



Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Michael Armbrust
Yes

!

On Thu, Dec 1, 2016 at 12:57 PM, ayan guha  wrote:

> Thanks TD. Will it be available in pyspark too?
> On 1 Dec 2016 19:55, "Tathagata Das"  wrote:
>
>> In the meantime, if you are interested, you can read the design doc in
>> the corresponding JIRA - https://issues.apache.org/ji
>> ra/browse/SPARK-18124
>>
>> On Thu, Dec 1, 2016 at 12:53 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> That feature is coming in 2.1.0. We have added watermarking, that will
>>> track the event time of the data and accordingly close old windows, output
>>> its corresponding aggregate and then drop its corresponding state. But in
>>> that case, you will have to use append mode, and aggregated data of a
>>> particular window will be evicted only when the windows is closed. You will
>>> be able to control the threshold on how long to wait for late, out-of-order
>>> data before closing a window.
>>>
>>> We will be updated the docs soon to explain this.
>>>
>>> On Tue, Nov 29, 2016 at 8:30 PM, Xinyu Zhang  wrote:
>>>
 Hi

 I want to use window operations. However, if i don't remove any data,
 the "complete" table will become larger and larger as time goes on. So I
 want to remove some outdated data in the complete table that I would never
 use.
 Is there any method to meet my requirement?

 Thanks!





>>>
>>>
>>


[GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Weiwei Zhang
Hi guys,

I am trying to compute the pagerank for the locations in the following
dummy dataframe,

*srcdes  shared_gas_stations*
 A   B   2
 A   C  10
 C   E   3
 D   E  12
 E   G   5
...

I have tried the function *graphframe.pageRank(resetProbability=0.01,
maxIter=20)* in GraphFrame but it seems like this function doesn't take
weighted edges. Maybe I am not using it correctly. How can I pass the
weighted edges to this function? Also I am not sure if this function works
for the undirected graph.


Thanks a lot!

- Weiwei


Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread ayan guha
Thanks TD. Will it be available in pyspark too?
On 1 Dec 2016 19:55, "Tathagata Das"  wrote:

> In the meantime, if you are interested, you can read the design doc in the
> corresponding JIRA - https://issues.apache.org/jira/browse/SPARK-18124
>
> On Thu, Dec 1, 2016 at 12:53 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> That feature is coming in 2.1.0. We have added watermarking, that will
>> track the event time of the data and accordingly close old windows, output
>> its corresponding aggregate and then drop its corresponding state. But in
>> that case, you will have to use append mode, and aggregated data of a
>> particular window will be evicted only when the windows is closed. You will
>> be able to control the threshold on how long to wait for late, out-of-order
>> data before closing a window.
>>
>> We will be updated the docs soon to explain this.
>>
>> On Tue, Nov 29, 2016 at 8:30 PM, Xinyu Zhang  wrote:
>>
>>> Hi
>>>
>>> I want to use window operations. However, if i don't remove any data,
>>> the "complete" table will become larger and larger as time goes on. So I
>>> want to remove some outdated data in the complete table that I would never
>>> use.
>>> Is there any method to meet my requirement?
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>
>>
>


quick question

2016-12-01 Thread kant kodali
Assume I am running a Spark Client Program in client mode and Spark Cluster
in Stand alone mode.

I want some clarification of the following things

1. Build a DAG
2. DAG Scheduler
3. TASK Scheduler

I want to which of the above part is done by SPARK CLIENT and which of the
above parts are done by SPARK MASTER in the stand alone case?

Building a DAG clearly looks like Spark Client Program
DAG Scheduler is also in the Spark Client Program
Task Scheduler is done by the SPARK MASTER.

is this correct? Also, Does Spark Client every instruct Spark Workers
directly on what transformations to run or the communication is just
unidirectional in the sense that Spark Workers communicate to Spark client
only when returning the results ?

thanks!


unsubscribe

2016-12-01 Thread Vishal Soni



support vector regression in spark

2016-12-01 Thread roni
Hi All,
 I  want to know how can I do support vector regression in SPARK?
 Thanks
R


Re: Spark-shell doesn't see changes coming from Kafka topic

2016-12-01 Thread Tathagata Das
Can you confirm the following?
1. Are you sending new data to the Kafka topic AFTER starting the streaming
query? Since you have specified `*startingOffsets` *as* `latest`*, data
needs to the topic after the query start for the query to receiver.
2. Are you able to read kafka data using Kafka's console consumer, from the
same machine running the query? That would clear up any confusion regarding
connectivity.

If the above are cleared, I would look at INFO and DEBUG level log4j logs
to see what the query is doing? is it stuck at some point or is it
continuously running but not finding latest offsets?


On Thu, Dec 1, 2016 at 6:31 AM, Otávio Carvalho  wrote:

> Hello hivemind,
>
> I am trying to connect my Spark 2.0.2 cluster to an Apache Kafka 0.10
> cluster via spark-shell.
>
> The connection works fine, but it is not able to receive the messages
> published to the topic.
>
> It doesn't throw any error, but it is not able to retrieve any message (I
> am sure that messages are being published 'cause I am able to read from the
> topic from the same machine)
>
> Here follows the spark-shell code/output:
>
> *val ds1 = spark.readStream*
> *.format("kafka")*
> *.option("subscribe", "clickstream")*
> *.option("kafka.bootstrap.servers",
> "ec2-54-208-12-171.compute-1.amazonaws.com:9092
> ")*
> *.option("startingOffsets", "latest")*
> *.load*
>
> *// Exiting paste mode, now interpreting.*
>
> *ds1: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5
> more fields]*
>
> *scala> val counter = ds1.groupBy("value").count*
> *counter: org.apache.spark.sql.DataFrame = [value: binary, count: bigint]*
>
> *scala> import org.apache.spark.sql.streaming.OutputMode.Complete*
> *import org.apache.spark.sql.streaming.OutputMode.Complete*
>
> *val query = counter.writeStream*
> *  .outputMode(Complete)*
> *  .format("console")*
> *  .start*
>
> *// Exiting paste mode, now interpreting.*
>
> *query: org.apache.spark.sql.streaming.StreamingQuery = Streaming Query -
> query-1 [state = ACTIVE]*
>
> *scala> query.status*
> *res0: org.apache.spark.sql.streaming.StreamingQueryStatus =*
> *Status of query 'query-1'*
> *Query id: 1*
> *Status timestamp: 1480602056895*
> *Input rate: 0.0 rows/sec*
> *Processing rate 0.0 rows/sec*
> *Latency: - ms*
> *Trigger details:*
> *isTriggerActive: true*
> *statusMessage: Finding new data from sources*
> *timestamp.triggerStart: 1480602056894*
> *triggerId: -1*
> *Source statuses [1 source]:*
> *Source 1 - KafkaSource[Subscribe[clickstream]]*
> *Available offset: -*
> *Input rate: 0.0 rows/sec*
> *Processing rate: 0.0 rows/sec*
> *Trigger details:*
> *triggerId: -1*
> *Sink status -
> org.apache.spark.sql.execution.streaming.ConsoleSink@54d5b6cb*
> *Committed offsets: [-]*
>
> I am starting the spark-shell as follows:
> /root/spark/bin/spark-shell --packages org.apache.spark:spark-sql-
> kafka-0-10_2.10:2.0.2
>
> Thanks,
> Otávio Carvalho.
>
> --
> Otávio Carvalho
> Consultant Developer
> Email ocarv...@thoughtworks.com
> Telephone +55 53 91565742 <+55+53+91565742>
> [image: ThoughtWorks]
> 
>


Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-12-01 Thread shyla deshpande
Used SparkSession, Works now. Thanks.

On Wed, Nov 30, 2016 at 11:02 PM, Deepak Sharma 
wrote:

> In Spark > 2.0 , spark session was introduced that you can use to query
> hive as well.
> Just make sure you create spark session with enableHiveSupport() option.
>
> Thanks
> Deepak
>
> On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande  > wrote:
>
>> I am Spark 2.0.2 , using DStreams because I need Cassandra Sink.
>>
>> How do I create SQLContext? I get the error SQLContext deprecated.
>>
>>
>> *[image: Inline image 1]*
>>
>> *Thanks*
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
This is the error received:


16/12/01 22:35:36 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
Terminating connection pool (set lazyInit to true if you expect to start 
your database after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4494053, see 
the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown 
Source)
.
.
--

org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a 
test connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
Terminating connection pool (set lazyInit to true if you expect to start 
your database after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at org.apache.derby.impl.jdb
.
.
.
NestedThrowables:
java.sql.SQLException: Unable to open a test connection to the given 
database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, 
username = APP. Terminating connection pool (set lazyInit to true if you 
expect to start your database after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
.
.
.
Caused by: java.sql.SQLException: Unable to open a test connection to the 
given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
Terminating connection pool (set lazyInit to true if you expect to start 
your database after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown 
Source)
.
.
.
16/12/01 22:48:09 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
Terminating connection pool (set lazyInit to true if you expect to start 
your database after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown 
Source)
.
.
.
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' 
with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown 
Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown 
Source)
.
.
.

Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@519dabfd, 
see the next exception for details.
at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 
Source)
... 111 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted 
the database 

Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver)

The following script/sequence works in Pyspark without any error against 
1.6.x, but fails with 2.x. 

people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])
peoplePartsRDD = people.map(lambda p: p.split(","))
peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], 
age=int(p[1])))
peopleDF= sqlContext.createDataFrame(peopleRDD)
peopleDF.first()

sqlContext2 = SQLContext(sc)
people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"])
peoplePartsRDD2 = people2.map(lambda l: l.split(","))
peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], 
age=int(p[1])))
peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # < error here


The error goes away if sqlContext2 is replaced with sqlContext in the 
error line. Is this a regression, or has something changed that makes this 
the expected behavior in Spark 2.x ?

Regards,
Vinayak





Unsubscribe

2016-12-01 Thread hardik nagda



RE: build models in parallel

2016-12-01 Thread Masood Krohy
You can use your groupId as a grid parameter, filter your dataset using 
this id in a pipeline stage, before feeding it to the model.
The following may help:
http://spark.apache.org/docs/latest/ml-tuning.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder

The above should work ,but I haven't tried it myself. What I have tried is 
the following Embarrassingly Parallel architecture (as TensorFlow was a 
requirement in the use case):

See a PySpark/TensorFlow example here:
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

A relevant excerpt from the notebook mentioned above:
http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0, 
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments, 
numSlices=len(grouped_experiments))
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()

Again, like above, you use your groupId as a parameter in the grid search; 
it works if your full dataset fits in the memory of a single machine. You 
can broadcast the dataset in a compressed format and do the preprocessing 
and feature engineering after you've done the filtering on groupId to 
maximize the size of the dataset that can use this modeling approach.
Masood


--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Xiaomeng Wan 
A : User 
Date :  2016-11-29 11:54
Objet : build models in parallel



I want to divide big data into groups (eg groupby some id), and build one 
model for each group. I am wondering whether I can parallelize the model 
building process by implementing a UDAF (eg running linearregression in 
its evaluate mothod). is it good practice? anybody has experience? Thanks!

Regards,
Shawn



Spark-shell doesn't see changes coming from Kafka topic

2016-12-01 Thread Otávio Carvalho
Hello hivemind,

I am trying to connect my Spark 2.0.2 cluster to an Apache Kafka 0.10
cluster via spark-shell.

The connection works fine, but it is not able to receive the messages
published to the topic.

It doesn't throw any error, but it is not able to retrieve any message (I
am sure that messages are being published 'cause I am able to read from the
topic from the same machine)

Here follows the spark-shell code/output:

*val ds1 = spark.readStream*
*.format("kafka")*
*.option("subscribe", "clickstream")*
*.option("kafka.bootstrap.servers",
"ec2-54-208-12-171.compute-1.amazonaws.com:9092
")*
*.option("startingOffsets", "latest")*
*.load*

*// Exiting paste mode, now interpreting.*

*ds1: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5
more fields]*

*scala> val counter = ds1.groupBy("value").count*
*counter: org.apache.spark.sql.DataFrame = [value: binary, count: bigint]*

*scala> import org.apache.spark.sql.streaming.OutputMode.Complete*
*import org.apache.spark.sql.streaming.OutputMode.Complete*

*val query = counter.writeStream*
*  .outputMode(Complete)*
*  .format("console")*
*  .start*

*// Exiting paste mode, now interpreting.*

*query: org.apache.spark.sql.streaming.StreamingQuery = Streaming Query -
query-1 [state = ACTIVE]*

*scala> query.status*
*res0: org.apache.spark.sql.streaming.StreamingQueryStatus =*
*Status of query 'query-1'*
*Query id: 1*
*Status timestamp: 1480602056895*
*Input rate: 0.0 rows/sec*
*Processing rate 0.0 rows/sec*
*Latency: - ms*
*Trigger details:*
*isTriggerActive: true*
*statusMessage: Finding new data from sources*
*timestamp.triggerStart: 1480602056894*
*triggerId: -1*
*Source statuses [1 source]:*
*Source 1 - KafkaSource[Subscribe[clickstream]]*
*Available offset: -*
*Input rate: 0.0 rows/sec*
*Processing rate: 0.0 rows/sec*
*Trigger details:*
*triggerId: -1*
*Sink status -
org.apache.spark.sql.execution.streaming.ConsoleSink@54d5b6cb*
*Committed offsets: [-]*

I am starting the spark-shell as follows:
/root/spark/bin/spark-shell --packages
org.apache.spark:spark-sql-kafka-0-10_2.10:2.0.2

Thanks,
Otávio Carvalho.

-- 
Otávio Carvalho
Consultant Developer
Email ocarv...@thoughtworks.com
Telephone +55 53 91565742 <+55+53+91565742>
[image: ThoughtWorks]



newly added Executors couldn't fetch jar files from Master

2016-12-01 Thread Evgenii Morozov
Hi

I’ve got working cluster for more, than couple of weeks with 20 workers. 
Everything was perfect.
Today I added 4 more workers and all of them couldn’t fetch jar files from 
master.

The following means to me that master is available to worker, it is registered 
there and it started everything. 
But when Executor got assigned task, it couldn’t download jar file from master.

16/12/01 06:25:36 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
spark://coarsegrainedschedu...@xx.xx.xx.xxx:47783 

16/12/01 06:25:36 INFO WorkerWatcher: Connecting to worker 
spark://wor...@xx.xx.xx.xx:46799 
16/12/01 06:25:36 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
16/12/01 06:25:36 INFO Executor: Starting executor ID 304 on host 
machine059.company.com 
16/12/01 06:25:36 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 48
844.
16/12/01 06:25:36 INFO NettyBlockTransferService: Server created on 48844
16/12/01 06:25:36 INFO BlockManagerMaster: Trying to register BlockManager
16/12/01 06:25:36 INFO BlockManagerMaster: Registered BlockManager
16/12/01 06:25:44 INFO CoarseGrainedExecutorBackend: Got assigned task 9239511
16/12/01 06:25:44 INFO Executor: Running task 46.0 in stage 159585.0 (TID 
9239511)
16/12/01 06:25:44 INFO Executor: Fetching 
http://xx.xx.xx.xxx:56027/jars/sparkws-core-1.0.0-20161118.151127-286.jar 
 
with timesta
mp 1479534760567
16/12/01 06:25:44 ERROR Executor: Exception in task 46.0 in stage 159585.0 (TID 
9239511)
java.io.FileNotFoundException: 
http://xx.xx.xx.xxx:56027/jars/sparkws-core-1.0.0-20161118.151127-286.jar 


When I issue the same download manually I’m getting simple file not found.
$ curl 
"http://xx.xx.xx.xxx:56027/jars/sparkws-spark-1.0.0-20161118.151127-286.jar 
"



Error 404 Not Found


HTTP ERROR: 404
Problem accessing /jars/sparkws-spark-1.0.0-20161118.151127-286.jar. Reason:
Not Found
Powered by Jetty://

Spark Environment tells jars are there
spark.jars  
/opt/sparkws/lib/sparkws-spark-1.0.0-20161118.151127-286.jar,/opt/sparkws/lib/sparkws-core-1.0.0-20161118.151127-286.jar

And they are indeed there
[user@machine004]$ ll /opt/sparkws/lib/ | grep sparkws
  156455 Nov 18 15:12 sparkws-core-1.0.0-20161118.151127-286.jar
18944178 Nov 18 15:14 sparkws-spark-1.0.0-20161118.151127-286.jar

Anyone might know what might be an issue here?



How to Check Dstream is empty or not?

2016-12-01 Thread rockinf...@gmail.com
I have integerated flume with spark using Flume-style Push-based Approach. 
I need to check whether Dstream is empty. Please suggest how can i do that?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Check-Dstream-is-empty-or-not-tp28151.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread Marco Mistroni
Kant,
We need to narrow it down to a reproducible code. You are using streaming
What is the content of ur streamed data. If u provide that I can run a
streaming programming that reads from a local dir and narrow down the
problem
I have seen similar error for doing something completely different. As u
say there might be problem with ur transformation coming from the structure
of the data. Send me a sample of the data you are streaming and I write a
small test casekr

On 1 Dec 2016 9:44 am, "kant kodali"  wrote:

> sorry for multiple emails. I just think more info is needed every time to
> address this problem
>
> My Spark Client program runs in a client mode and it runs on a node that
> has 2 vCPU's and 8GB RAM (m4.large)
> I have 2 Spark worker nodes and each have 4 vCPU's and 16GB RAM
>  (m3.xlarge for each spark worker instance)
>
>
>
> On Thu, Dec 1, 2016 at 12:55 AM, kant kodali  wrote:
>
>> My batch interval is 1s
>> slide interval is 1s
>> window interval is 1 minute
>>
>> I am using a standalone alone cluster. I don't have any storage layer
>> like HDFS.  so I dont know what is a connection between RDD and blocks (I
>> know that for every batch one RDD is produced)? what is a block in this
>> context? is it a disk block ? if so, what is it default size? and Finally,
>> why does the following error happens so often?
>>
>> java.lang.Exception: Could not compute split, block input-0-1480539568000
>> not found
>>
>>
>>
>> On Thu, Dec 1, 2016 at 12:42 AM, kant kodali  wrote:
>>
>>> I also use this super(StorageLevel.MEMORY_AND_DISK_2());
>>>
>>> inside my receiver
>>>
>>> On Wed, Nov 30, 2016 at 10:44 PM, kant kodali 
>>> wrote:
>>>
 Here is another transformation that might cause the error but it has to
 be one of these two since I only have two transformations

 jsonMessagesDStream
 .window(new Duration(6), new Duration(1000))
 .mapToPair(new PairFunction() {
 @Override
 public Tuple2 call(String s) throws Exception {
 //System.out.println(s + " *");
 JsonParser parser = new JsonParser();
 JsonObject jsonObj = parser.parse(s).getAsJsonObject();

 if (jsonObj != null && jsonObj.has("var1")) {
 JsonObject jsonObject = 
 jsonObj.get("var1").getAsJsonObject();
 if (jsonObject != null && jsonObject.has("var2") && 
 jsonObject.get("var2").getAsBoolean() && jsonObject.has("var3") ) {
 long num = jsonObject.get("var3").getAsLong();

 return new Tuple2("var3", num);
 }
 }

 return new Tuple2("var3", 0L);
 }
 }).reduceByKey(new Function2() {
 @Override
 public Long call(Long v1, Long v2) throws Exception {
 return v1+v2;
  }
 }).foreachRDD(new VoidFunction>() {
 @Override
 public void call(JavaPairRDD 
 stringIntegerJavaPairRDD) throws Exception {
 Map map = new HashMap<>();
 Gson gson = new Gson();
 stringIntegerJavaPairRDD
 .collect()
 .forEach((Tuple2 KV) -> {
 String status = KV._1();
 Long count = KV._2();
 map.put(status, count);
 }
 );
 NSQReceiver.send(producer, "dashboard", 
 gson.toJson(map).getBytes());
 }
 });


 On Wed, Nov 30, 2016 at 10:40 PM, kant kodali 
 wrote:

> Hi Marco,
>
>
> Here is what my code looks like
>
> Config config = new Config("hello");
> SparkConf sparkConf = config.buildSparkConfig();
> sparkConf.setJars(JavaSparkContext.jarOfClass(Driver.class));
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
> Duration(config.getSparkStremingBatchInterval()));
> ssc.sparkContext().setLogLevel("ERROR");
>
>
> NSQReceiver sparkStreamingReceiver = new NSQReceiver(config, 
> "input_test");
> JavaReceiverInputDStream jsonMessagesDStream = 
> ssc.receiverStream(sparkStreamingReceiver);
>
>
> NSQProducer producer = new NSQProducer()
> .addAddress(config.getServerConfig().getProperty("NSQD_IP"), 
> Integer.parseInt(config.getServerConfig().getProperty("NSQD_PORT")))
> .start();
>
> 

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
sorry for multiple emails. I just think more info is needed every time to
address this problem

My Spark Client program runs in a client mode and it runs on a node that
has 2 vCPU's and 8GB RAM (m4.large)
I have 2 Spark worker nodes and each have 4 vCPU's and 16GB RAM  (m3.xlarge
for each spark worker instance)



On Thu, Dec 1, 2016 at 12:55 AM, kant kodali  wrote:

> My batch interval is 1s
> slide interval is 1s
> window interval is 1 minute
>
> I am using a standalone alone cluster. I don't have any storage layer like
> HDFS.  so I dont know what is a connection between RDD and blocks (I know
> that for every batch one RDD is produced)? what is a block in this context?
> is it a disk block ? if so, what is it default size? and Finally, why does
> the following error happens so often?
>
> java.lang.Exception: Could not compute split, block input-0-1480539568000
> not found
>
>
>
> On Thu, Dec 1, 2016 at 12:42 AM, kant kodali  wrote:
>
>> I also use this super(StorageLevel.MEMORY_AND_DISK_2());
>>
>> inside my receiver
>>
>> On Wed, Nov 30, 2016 at 10:44 PM, kant kodali  wrote:
>>
>>> Here is another transformation that might cause the error but it has to
>>> be one of these two since I only have two transformations
>>>
>>> jsonMessagesDStream
>>> .window(new Duration(6), new Duration(1000))
>>> .mapToPair(new PairFunction() {
>>> @Override
>>> public Tuple2 call(String s) throws Exception {
>>> //System.out.println(s + " *");
>>> JsonParser parser = new JsonParser();
>>> JsonObject jsonObj = parser.parse(s).getAsJsonObject();
>>>
>>> if (jsonObj != null && jsonObj.has("var1")) {
>>> JsonObject jsonObject = 
>>> jsonObj.get("var1").getAsJsonObject();
>>> if (jsonObject != null && jsonObject.has("var2") && 
>>> jsonObject.get("var2").getAsBoolean() && jsonObject.has("var3") ) {
>>> long num = jsonObject.get("var3").getAsLong();
>>>
>>> return new Tuple2("var3", num);
>>> }
>>> }
>>>
>>> return new Tuple2("var3", 0L);
>>> }
>>> }).reduceByKey(new Function2() {
>>> @Override
>>> public Long call(Long v1, Long v2) throws Exception {
>>> return v1+v2;
>>>  }
>>> }).foreachRDD(new VoidFunction>() {
>>> @Override
>>> public void call(JavaPairRDD 
>>> stringIntegerJavaPairRDD) throws Exception {
>>> Map map = new HashMap<>();
>>> Gson gson = new Gson();
>>> stringIntegerJavaPairRDD
>>> .collect()
>>> .forEach((Tuple2 KV) -> {
>>> String status = KV._1();
>>> Long count = KV._2();
>>> map.put(status, count);
>>> }
>>> );
>>> NSQReceiver.send(producer, "dashboard", 
>>> gson.toJson(map).getBytes());
>>> }
>>> });
>>>
>>>
>>> On Wed, Nov 30, 2016 at 10:40 PM, kant kodali 
>>> wrote:
>>>
 Hi Marco,


 Here is what my code looks like

 Config config = new Config("hello");
 SparkConf sparkConf = config.buildSparkConfig();
 sparkConf.setJars(JavaSparkContext.jarOfClass(Driver.class));
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
 Duration(config.getSparkStremingBatchInterval()));
 ssc.sparkContext().setLogLevel("ERROR");


 NSQReceiver sparkStreamingReceiver = new NSQReceiver(config, "input_test");
 JavaReceiverInputDStream jsonMessagesDStream = 
 ssc.receiverStream(sparkStreamingReceiver);


 NSQProducer producer = new NSQProducer()
 .addAddress(config.getServerConfig().getProperty("NSQD_IP"), 
 Integer.parseInt(config.getServerConfig().getProperty("NSQD_PORT")))
 .start();

 jsonMessagesDStream
 .mapToPair(new PairFunction() {
 @Override
 public Tuple2 call(String s) throws Exception 
 {
 JsonParser parser = new JsonParser();
 JsonObject jsonObj = parser.parse(s).getAsJsonObject();
 if (jsonObj != null && jsonObj.has("var1") ) {
 JsonObject transactionObject = 
 jsonObj.get("var1").getAsJsonObject();
 if(transactionObject != null && 
 transactionObject.has("var2")) {
 String key = 
 

Re: Spark Job not exited and shows running

2016-12-01 Thread Selvam Raman
Hi,

I have run the job in cluster mode as well. The job is not ending. After
sometime the container just do nothing but it shows running.

In my code, every record has been inserted into solr and cassandra as well.
When i ran it only for solr the job completed successfully. Still i did not
test cassandra part. Will check and update.

does anyone have faced this issue earlier.

I added sparsession.stop after foreachpartition ends.


My code overview:

SparkSession
read parquet file(20 partition- roughly 90k records)
foreachpartition
  every record do some compution
  insert into cassandra(  i am using insert command )
 index into solr

stop the sparksession
exit the code.




Thanks,
selvam R

On Thu, Dec 1, 2016 at 7:03 AM, Daniel van der Ende <
daniel.vandere...@gmail.com> wrote:

> Hi,
>
> I've seen this a few times too. Usually it indicates that your driver
> doesn't have enough resources to process the result. Sometimes increasing
> driver memory is enough (yarn memory overhead can also help). Is there any
> specific reason for you to run in client mode and not in cluster mode?
> Having run into this a number of times (and wanting to spare the resources
> of our submitting machines) we have now switched to use yarn cluster mode
> by default. This seems to resolve the problem.
>
> Hope this helps,
>
> Daniel
>
> On 29 Nov 2016 11:20 p.m., "Selvam Raman"  wrote:
>
>> Hi,
>>
>> I have submitted spark job in yarn client mode. The executor and cores
>> were dynamically allocated. In the job i have 20 partitions, so 5 container
>> each with 4 core has been submitted. It almost processed all the records
>> but it never exit the job and in the application master container i am
>> seeing the below error message.
>>
>>  INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
>>  WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
>>
>>
>>
>> ​The same job i ran it for only 1000 records which successfully finished.
>> ​
>>
>> Can anyone help me to sort out this issue.
>>
>> Spark version:2.0( AWS EMR).
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
My batch interval is 1s
slide interval is 1s
window interval is 1 minute

I am using a standalone alone cluster. I don't have any storage layer like
HDFS.  so I dont know what is a connection between RDD and blocks (I know
that for every batch one RDD is produced)? what is a block in this context?
is it a disk block ? if so, what is it default size? and Finally, why does
the following error happens so often?

java.lang.Exception: Could not compute split, block input-0-1480539568000
not found



On Thu, Dec 1, 2016 at 12:42 AM, kant kodali  wrote:

> I also use this super(StorageLevel.MEMORY_AND_DISK_2());
>
> inside my receiver
>
> On Wed, Nov 30, 2016 at 10:44 PM, kant kodali  wrote:
>
>> Here is another transformation that might cause the error but it has to
>> be one of these two since I only have two transformations
>>
>> jsonMessagesDStream
>> .window(new Duration(6), new Duration(1000))
>> .mapToPair(new PairFunction() {
>> @Override
>> public Tuple2 call(String s) throws Exception {
>> //System.out.println(s + " *");
>> JsonParser parser = new JsonParser();
>> JsonObject jsonObj = parser.parse(s).getAsJsonObject();
>>
>> if (jsonObj != null && jsonObj.has("var1")) {
>> JsonObject jsonObject = 
>> jsonObj.get("var1").getAsJsonObject();
>> if (jsonObject != null && jsonObject.has("var2") && 
>> jsonObject.get("var2").getAsBoolean() && jsonObject.has("var3") ) {
>> long num = jsonObject.get("var3").getAsLong();
>>
>> return new Tuple2("var3", num);
>> }
>> }
>>
>> return new Tuple2("var3", 0L);
>> }
>> }).reduceByKey(new Function2() {
>> @Override
>> public Long call(Long v1, Long v2) throws Exception {
>> return v1+v2;
>>  }
>> }).foreachRDD(new VoidFunction>() {
>> @Override
>> public void call(JavaPairRDD 
>> stringIntegerJavaPairRDD) throws Exception {
>> Map map = new HashMap<>();
>> Gson gson = new Gson();
>> stringIntegerJavaPairRDD
>> .collect()
>> .forEach((Tuple2 KV) -> {
>> String status = KV._1();
>> Long count = KV._2();
>> map.put(status, count);
>> }
>> );
>> NSQReceiver.send(producer, "dashboard", 
>> gson.toJson(map).getBytes());
>> }
>> });
>>
>>
>> On Wed, Nov 30, 2016 at 10:40 PM, kant kodali  wrote:
>>
>>> Hi Marco,
>>>
>>>
>>> Here is what my code looks like
>>>
>>> Config config = new Config("hello");
>>> SparkConf sparkConf = config.buildSparkConfig();
>>> sparkConf.setJars(JavaSparkContext.jarOfClass(Driver.class));
>>> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
>>> Duration(config.getSparkStremingBatchInterval()));
>>> ssc.sparkContext().setLogLevel("ERROR");
>>>
>>>
>>> NSQReceiver sparkStreamingReceiver = new NSQReceiver(config, "input_test");
>>> JavaReceiverInputDStream jsonMessagesDStream = 
>>> ssc.receiverStream(sparkStreamingReceiver);
>>>
>>>
>>> NSQProducer producer = new NSQProducer()
>>> .addAddress(config.getServerConfig().getProperty("NSQD_IP"), 
>>> Integer.parseInt(config.getServerConfig().getProperty("NSQD_PORT")))
>>> .start();
>>>
>>> jsonMessagesDStream
>>> .mapToPair(new PairFunction() {
>>> @Override
>>> public Tuple2 call(String s) throws Exception {
>>> JsonParser parser = new JsonParser();
>>> JsonObject jsonObj = parser.parse(s).getAsJsonObject();
>>> if (jsonObj != null && jsonObj.has("var1") ) {
>>> JsonObject transactionObject = 
>>> jsonObj.get("var1").getAsJsonObject();
>>> if(transactionObject != null && 
>>> transactionObject.has("var2")) {
>>> String key = 
>>> transactionObject.get("var2").getAsString();
>>> return new Tuple2<>(key, 1);
>>> }
>>> }
>>> return new Tuple2<>("", 0);
>>> }
>>> }).reduceByKey(new Function2() {
>>> @Override
>>> public Integer call(Integer v1, Integer v2) throws 
>>> Exception {
>>> return v1+v2;
>>> }
>>> }).foreachRDD(new VoidFunction>() {
>>>  

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Tathagata Das
In the meantime, if you are interested, you can read the design doc in the
corresponding JIRA - https://issues.apache.org/jira/browse/SPARK-18124

On Thu, Dec 1, 2016 at 12:53 AM, Tathagata Das 
wrote:

> That feature is coming in 2.1.0. We have added watermarking, that will
> track the event time of the data and accordingly close old windows, output
> its corresponding aggregate and then drop its corresponding state. But in
> that case, you will have to use append mode, and aggregated data of a
> particular window will be evicted only when the windows is closed. You will
> be able to control the threshold on how long to wait for late, out-of-order
> data before closing a window.
>
> We will be updated the docs soon to explain this.
>
> On Tue, Nov 29, 2016 at 8:30 PM, Xinyu Zhang  wrote:
>
>> Hi
>>
>> I want to use window operations. However, if i don't remove any data, the
>> "complete" table will become larger and larger as time goes on. So I want
>> to remove some outdated data in the complete table that I would never use.
>> Is there any method to meet my requirement?
>>
>> Thanks!
>>
>>
>>
>>
>>
>
>


Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Tathagata Das
That feature is coming in 2.1.0. We have added watermarking, that will
track the event time of the data and accordingly close old windows, output
its corresponding aggregate and then drop its corresponding state. But in
that case, you will have to use append mode, and aggregated data of a
particular window will be evicted only when the windows is closed. You will
be able to control the threshold on how long to wait for late, out-of-order
data before closing a window.

We will be updated the docs soon to explain this.

On Tue, Nov 29, 2016 at 8:30 PM, Xinyu Zhang  wrote:

> Hi
>
> I want to use window operations. However, if i don't remove any data, the
> "complete" table will become larger and larger as time goes on. So I want
> to remove some outdated data in the complete table that I would never use.
> Is there any method to meet my requirement?
>
> Thanks!
>
>
>
>
>


Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
I also use this super(StorageLevel.MEMORY_AND_DISK_2());

inside my receiver

On Wed, Nov 30, 2016 at 10:44 PM, kant kodali  wrote:

> Here is another transformation that might cause the error but it has to be
> one of these two since I only have two transformations
>
> jsonMessagesDStream
> .window(new Duration(6), new Duration(1000))
> .mapToPair(new PairFunction() {
> @Override
> public Tuple2 call(String s) throws Exception {
> //System.out.println(s + " *");
> JsonParser parser = new JsonParser();
> JsonObject jsonObj = parser.parse(s).getAsJsonObject();
>
> if (jsonObj != null && jsonObj.has("var1")) {
> JsonObject jsonObject = 
> jsonObj.get("var1").getAsJsonObject();
> if (jsonObject != null && jsonObject.has("var2") && 
> jsonObject.get("var2").getAsBoolean() && jsonObject.has("var3") ) {
> long num = jsonObject.get("var3").getAsLong();
>
> return new Tuple2("var3", num);
> }
> }
>
> return new Tuple2("var3", 0L);
> }
> }).reduceByKey(new Function2() {
> @Override
> public Long call(Long v1, Long v2) throws Exception {
> return v1+v2;
>  }
> }).foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaPairRDD 
> stringIntegerJavaPairRDD) throws Exception {
> Map map = new HashMap<>();
> Gson gson = new Gson();
> stringIntegerJavaPairRDD
> .collect()
> .forEach((Tuple2 KV) -> {
> String status = KV._1();
> Long count = KV._2();
> map.put(status, count);
> }
> );
> NSQReceiver.send(producer, "dashboard", 
> gson.toJson(map).getBytes());
> }
> });
>
>
> On Wed, Nov 30, 2016 at 10:40 PM, kant kodali  wrote:
>
>> Hi Marco,
>>
>>
>> Here is what my code looks like
>>
>> Config config = new Config("hello");
>> SparkConf sparkConf = config.buildSparkConfig();
>> sparkConf.setJars(JavaSparkContext.jarOfClass(Driver.class));
>> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new 
>> Duration(config.getSparkStremingBatchInterval()));
>> ssc.sparkContext().setLogLevel("ERROR");
>>
>>
>> NSQReceiver sparkStreamingReceiver = new NSQReceiver(config, "input_test");
>> JavaReceiverInputDStream jsonMessagesDStream = 
>> ssc.receiverStream(sparkStreamingReceiver);
>>
>>
>> NSQProducer producer = new NSQProducer()
>> .addAddress(config.getServerConfig().getProperty("NSQD_IP"), 
>> Integer.parseInt(config.getServerConfig().getProperty("NSQD_PORT")))
>> .start();
>>
>> jsonMessagesDStream
>> .mapToPair(new PairFunction() {
>> @Override
>> public Tuple2 call(String s) throws Exception {
>> JsonParser parser = new JsonParser();
>> JsonObject jsonObj = parser.parse(s).getAsJsonObject();
>> if (jsonObj != null && jsonObj.has("var1") ) {
>> JsonObject transactionObject = 
>> jsonObj.get("var1").getAsJsonObject();
>> if(transactionObject != null && 
>> transactionObject.has("var2")) {
>> String key = 
>> transactionObject.get("var2").getAsString();
>> return new Tuple2<>(key, 1);
>> }
>> }
>> return new Tuple2<>("", 0);
>> }
>> }).reduceByKey(new Function2() {
>> @Override
>> public Integer call(Integer v1, Integer v2) throws Exception 
>> {
>> return v1+v2;
>> }
>> }).foreachRDD(new VoidFunction>() {
>> @Override
>> public void call(JavaPairRDD 
>> stringIntegerJavaPairRDD) throws Exception {
>> Map map = new HashMap<>();
>> Gson gson = new Gson();
>> stringIntegerJavaPairRDD
>> .collect()
>> .forEach((Tuple2 KV) -> {
>> String status = KV._1();
>> Integer count = KV._2();
>> map.put(status, count);
>> }
>> );
>> 

RE: PySpark to remote cluster

2016-12-01 Thread Schaefers, Klaus
Hi,

I moved my Pyspark to 2.0.1 and now I can connect. However, I cannot execute 
any job. I always get an  "16/12/01 09:37:07 WARN TaskSchedulerImpl: Initial 
job has not accepted any resources; check your cluster UI to ensure that 
workers are registered and have sufficient resources" error. I added more 
resources to the nodes, restricted the default cores and memory so Pyspark does 
not consume all, but still I cannot count the size of an 5 element RDD!

Any Idea what might be causing this?

Best,

Klaus

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, December 1, 2016 12:44 AM
To: user@spark.apache.org; Schaefers, Klaus 
Subject: Re: PySpark to remote cluster

Spark 2.0.1 is running with a different py4j library than Spark 1.6.

You will probably run into other problems mixing versions though - is there a 
reason you can't run Spark 1.6 on the client?


_
From: Klaus Schaefers 
>
Sent: Wednesday, November 30, 2016 2:44 AM
Subject: PySpark to remote cluster
To: >


Hi,

I want to connect with a local Jupyter Notebook to a remote Spark cluster.
The Cluster is running Spark 2.0.1 and the Jupyter notebook is based on
Spark 1.6 and running in a docker image (Link). I try to init the
SparkContext like this:

import pyspark
sc = pyspark.SparkContext('spark://:7077')

However, this gives me the following exception:


ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 626, in send_command
response = connection.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 750, in send_command
raise Py4JNetworkError("Error while sending or receiving", e)
py4j.protocol.Py4JNetworkError: Error while sending or receiving

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 740, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/opt/conda/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.5/site-packages/IPython/utils/PyColorize.py",
line 262, in format2
for atoken in generate_tokens(text.readline):
File "/opt/conda/lib/python3.5/tokenize.py", line 597, in _tokenize
raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))


Is this error caused by the different spark versions?

Best,

Klaus




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-to-remote-cluster-tp28147.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.