Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Zsolt Tóth
Hi,

this is exactly the same as my issue, seems to be a bug in 1.5.x.
(see my thread for details)

2015-11-19 11:20 GMT+01:00 Jeff Zhang :

> Seems your jdbc url is not correct. Should be jdbc:mysql://
> 192.168.41.229:3306
>
> On Thu, Nov 19, 2015 at 6:03 PM,  wrote:
>
>> hi guy,
>>
>>I also found  --driver-class-path and spark.driver.extraClassPath
>> is not working when I'm accessing mysql driver in my spark APP.
>>
>>the attache is my config for my APP.
>>
>>
>> here are my command and the  logs of the failure i encountted.
>>
>>
>> [root@h170 spark]#  bin/spark-shell --master spark://h170:7077
>> --driver-class-path lib/mysql-connector-java-5.1.32-bin.jar --jars
>> lib/mysql-connector-java-5.1.32-bin.jar
>>
>> log4j:WARN No appenders could be found for logger
>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>
>> log4j:WARN Please initialize the log4j system properly.
>>
>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> more info.
>>
>> Using Spark's repl log4j profile:
>> org/apache/spark/log4j-defaults-repl.properties
>>
>> To adjust logging level use sc.setLogLevel("INFO")
>>
>> Welcome to
>>
>>     __
>>
>>  / __/__  ___ _/ /__
>>
>> _\ \/ _ \/ _ `/ __/  '_/
>>
>>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>>
>>   /_/
>>
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_79)
>>
>> Type in expressions to have them evaluated.
>>
>> Type :help for more information.
>>
>> 15/11/19 17:51:33 WARN MetricsSystem: Using default name DAGScheduler for
>> source because spark.app.id is not set.
>>
>> Spark context available as sc.
>>
>> 15/11/19 17:51:35 WARN Connection: BoneCP specified but not present in
>> CLASSPATH (or one of dependencies)
>>
>> 15/11/19 17:51:35 WARN Connection: BoneCP specified but not present in
>> CLASSPATH (or one of dependencies)
>>
>> 15/11/19 17:51:48 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so recording
>> the schema version 1.2.0
>>
>> 15/11/19 17:51:48 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>>
>> 15/11/19 17:51:50 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> 15/11/19 17:51:50 WARN Connection: BoneCP specified but not present in
>> CLASSPATH (or one of dependencies)
>>
>> 15/11/19 17:51:50 WARN Connection: BoneCP specified but not present in
>> CLASSPATH (or one of dependencies)
>>
>> SQL context available as sqlContext.
>>
>>
>> scala> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" ->
>> "mysql://192.168.41.229:3306",  "dbtable" -> "sqoop.SQOOP_ROOT")).load()
>>
>> java.sql.SQLException: No suitable driver found for mysql://
>> 192.168.41.229:3306
>>
>> at java.sql.DriverManager.getConnection(DriverManager.java:596)
>>
>> at java.sql.DriverManager.getConnection(DriverManager.java:187)
>>
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
>>
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
>>
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
>>
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>>
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>>
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>>
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>>
>> at $iwC$$iwC$$iwC$$iwC.(:30)
>>
>> at $iwC$$iwC$$iwC.(:32)
>>
>> at $iwC$$iwC.(:34)
>>
>> at $iwC.(:36)
>>
>> at (:38)
>>
>> at .(:42)
>>
>> at .()
>>
>> at .(:7)
>>
>> at .()
>>
>> at $print()
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>>
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>>
>> at
>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>>
>> 

[SPARK STREAMING] multiple hosts and multiple ports for Stream job

2015-11-19 Thread diplomatic Guru
Hello team,

I was wondering whether it is a good idea to have multiple hosts and
multiple ports for a spark job. Let's say that there are two hosts, and
each has 2 ports, is this a good idea? If this is not an issue then what is
the best way to do it. Currently, we pass it as an argument comma separated.


Re: Distinct on key-value pair of JavaRDD

2015-11-19 Thread Ramkumar V
I thought some specific function would be there but I'm using reducebykey now.
Its working fine. Thanks a lot.

*Thanks*,



On Tue, Nov 17, 2015 at 6:21 PM, ayan guha  wrote:

> How about using reducebykey?
> On 17 Nov 2015 22:00, "Ramkumar V"  wrote:
>
>> Hi,
>>
>> I have JavaRDD. I would like to do distinct only on key
>> but the normal distinct applies on both key and value. i want to apply only
>> on key. How to do that ?
>>
>> Any help is appreciated.
>>
>> *Thanks*,
>> 
>>
>>


Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sanket Patil
Hey Sandip:

TD has already outlined the right approach, but let me add a couple of
thoughts as I recently worked on a similar project. I had to compute some
real-time metrics on streaming data. Also, these metrics had to be
aggregated for hour/day/week/month. My data pipeline was Kafka --> Spark
Streaming --> Cassandra.

I had a spark streaming job that did the following: (1) receive a window of
raw streaming data and write it to Cassandra, and (2) do only the basic
computations that need to be shown on a real-time dashboard, and store the
results in Cassandra. (I had to use sliding window as my computation
involved joining data that might occur in different time windows.)

I had a separate set of Spark jobs that pulled the raw data from Cassandra,
computed the aggregations and more complex metrics, and wrote it back to
the relevant Cassandra tables. These jobs ran periodically every few
minutes.

Regards,
Sanket

On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta 
wrote:

> Thank you TD for your time and help.
>
> SM
>
> On 19-Nov-2015, at 6:58 AM, Tathagata Das  wrote:
>
> There are different ways to do the rollups. Either update rollups from the
> streaming application, or you can generate roll ups in a later process -
> say periodic Spark job every hour. Or you could just generate rollups on
> demand, when it is queried.
> The whole thing depends on your downstream requirements - if you always to
> have up to date rollups to show up in dashboard (even day-level stuff),
> then the first approach is better. Otherwise, second and third approaches
> are more efficient.
>
> TD
>
>
> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta 
> wrote:
>
>> TD thank you for your reply.
>>
>> I agree on data store requirement. I am using HBase as an underlying
>> store.
>>
>> So for every batch interval of say 10 seconds
>>
>> - Calculate the time dimension ( minutes, hours, day, week, month and
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics
>> for a given set of dimensions.
>>
>> Only caveat I see is I’ll have to update each of the different roll up
>> table for each batch window.
>>
>> Is this a valid approach for calculating time series aggregation?
>>
>> Regards
>> SM
>>
>> For minutes level aggregates I have set up a streaming window say 10
>> seconds and storing minutes level aggregates across multiple dimension in
>> HBase at every window interval.
>>
>> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
>>
>> For this sort of long term aggregations you should use a dedicated data
>> storage systems. Like a database, or a key-value store. Spark Streaming
>> would just aggregate and push the necessary data to the data store.
>>
>> TD
>>
>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta > > wrote:
>>
>>> Hi,
>>>
>>> I am working on requirement of calculating real time metrics and
>>> building prototype  on Spark streaming. I need to build aggregate at
>>> Seconds, Minutes, Hours and Day level.
>>>
>>> I am not sure whether I should calculate all these aggregates as
>>> different Windowed function on input DStream or shall I use
>>> updateStateByKey function for the same. If I have to use updateStateByKey
>>> for these time series aggregation, how can I remove keys from the state
>>> after different time lapsed?
>>>
>>> Please suggest.
>>>
>>> Regards
>>> SM
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
>


-- 
SuperReceptionist is now available on Android mobiles. Track your business
on the go with call analytics, recordings, insights and more: Download the
app here


-- 
SuperReceptionist is now available on Android mobiles. Track your business 
on the go with call analytics, recordings, insights and more: Download the 
app here 



Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sandip Mehta
Thank you Sanket for the feedback.

Regards
SM
> On 19-Nov-2015, at 1:57 PM, Sanket Patil  wrote:
> 
> Hey Sandip:
> 
> TD has already outlined the right approach, but let me add a couple of 
> thoughts as I recently worked on a similar project. I had to compute some 
> real-time metrics on streaming data. Also, these metrics had to be aggregated 
> for hour/day/week/month. My data pipeline was Kafka --> Spark Streaming --> 
> Cassandra.
> 
> I had a spark streaming job that did the following: (1) receive a window of 
> raw streaming data and write it to Cassandra, and (2) do only the basic 
> computations that need to be shown on a real-time dashboard, and store the 
> results in Cassandra. (I had to use sliding window as my computation involved 
> joining data that might occur in different time windows.)
> 
> I had a separate set of Spark jobs that pulled the raw data from Cassandra, 
> computed the aggregations and more complex metrics, and wrote it back to the 
> relevant Cassandra tables. These jobs ran periodically every few minutes.
> 
> Regards,
> Sanket
> 
> On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta  > wrote:
> Thank you TD for your time and help.
> 
> SM
>> On 19-Nov-2015, at 6:58 AM, Tathagata Das > > wrote:
>> 
>> There are different ways to do the rollups. Either update rollups from the 
>> streaming application, or you can generate roll ups in a later process - say 
>> periodic Spark job every hour. Or you could just generate rollups on demand, 
>> when it is queried.
>> The whole thing depends on your downstream requirements - if you always to 
>> have up to date rollups to show up in dashboard (even day-level stuff), then 
>> the first approach is better. Otherwise, second and third approaches are 
>> more efficient.
>> 
>> TD
>> 
>> 
>> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta > > wrote:
>> TD thank you for your reply.
>> 
>> I agree on data store requirement. I am using HBase as an underlying store.
>> 
>> So for every batch interval of say 10 seconds
>> 
>> - Calculate the time dimension ( minutes, hours, day, week, month and 
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics for 
>> a given set of dimensions.
>> 
>> Only caveat I see is I’ll have to update each of the different roll up table 
>> for each batch window.
>> 
>> Is this a valid approach for calculating time series aggregation?
>> 
>> Regards
>> SM
>> 
>> For minutes level aggregates I have set up a streaming window say 10 seconds 
>> and storing minutes level aggregates across multiple dimension in HBase at 
>> every window interval. 
>> 
>>> On 18-Nov-2015, at 7:45 AM, Tathagata Das >> > wrote:
>>> 
>>> For this sort of long term aggregations you should use a dedicated data 
>>> storage systems. Like a database, or a key-value store. Spark Streaming 
>>> would just aggregate and push the necessary data to the data store. 
>>> 
>>> TD
>>> 
>>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta >> > wrote:
>>> Hi,
>>> 
>>> I am working on requirement of calculating real time metrics and building 
>>> prototype  on Spark streaming. I need to build aggregate at Seconds, 
>>> Minutes, Hours and Day level.
>>> 
>>> I am not sure whether I should calculate all these aggregates as  different 
>>> Windowed function on input DStream or shall I use updateStateByKey function 
>>> for the same. If I have to use updateStateByKey for these time series 
>>> aggregation, how can I remove keys from the state after different time 
>>> lapsed?
>>> 
>>> Please suggest.
>>> 
>>> Regards
>>> SM
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> 
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here 
> 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here 


[no subject]

2015-11-19 Thread aman solanki
Hi All,

I want to know how one can get historical data of jobs,stages,tasks etc of
a running spark application.

Please share the information regarding the same.

Thanks,
Aman Solanki


Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread D
I am trying to write a simple "Hello World" kind of application using 
spark streaming and RabbitMq, in which Apache Spark Streaming will read 
message from RabbitMq via the RabbitMqReceiver 
 and print it in the 
console. But some how I am not able to print the string read from Rabbit 
Mq into console. The spark streaming code is printing the message below:-


|Value Received BlockRDD[1] at ReceiverInputDStream at 
RabbitMQInputDStream.scala:33 Value Received BlockRDD[2] at 
ReceiverInputDStream at RabbitMQInputDStream.scala:33 |


The message is sent to the rabbitmq via the simple code below:-

|package helloWorld; import com.rabbitmq.client.Channel; import 
com.rabbitmq.client.Connection; import 
com.rabbitmq.client.ConnectionFactory; public class Send { private final 
static String QUEUE_NAME = "hello1"; public static void main(String[] 
argv) throws Exception { ConnectionFactory factory = new 
ConnectionFactory(); factory.setHost("localhost"); Connection connection 
= factory.newConnection(); Channel channel = connection.createChannel(); 
channel.queueDeclare(QUEUE_NAME, false, false, false, null); String 
message = "Hello World! is a code. Hi Hello World!"; 
channel.basicPublish("", QUEUE_NAME, null, message.getBytes("UTF-8")); 
System.out.println(" [x] Sent '" + message + "'"); channel.close(); 
connection.close(); } } |


I am trying to read messages via Apache Streaming as shown below:-

|package rabbitmq.example; import java.util.*; import 
org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; import 
org.apache.spark.streaming.Durations; import 
org.apache.spark.streaming.api.java.JavaReceiverInputDStream; import 
org.apache.spark.streaming.api.java.JavaStreamingContext; import 
com.stratio.receiver.RabbitMQUtils; public class RabbitMqEx { public 
static void main(String[] args) { System.out.println("Creating Spark 
Configuration"); SparkConf conf = new SparkConf(); 
conf.setAppName("RabbitMq Receiver Example"); 
conf.setMaster("local[2]"); System.out.println("Retreiving Streaming 
Context from Spark Conf"); JavaStreamingContext streamCtx = new 
JavaStreamingContext(conf, Durations.seconds(2)); MaprabbitMqConParams = new HashMap(); 
rabbitMqConParams.put("host", "localhost"); 
rabbitMqConParams.put("queueName", "hello1"); System.out.println("Trying 
to connect to RabbitMq"); JavaReceiverInputDStream 
receiverStream = RabbitMQUtils.createJavaStream(streamCtx, 
rabbitMqConParams); receiverStream.foreachRDD(new 
Function() { @Override public Void 
call(JavaRDD arg0) throws Exception { System.out.println("Value 
Received " + arg0.toString()); return null; } } ); streamCtx.start(); 
streamCtx.awaitTermination(); } } |


The output console only has message like the following:-

|Creating Spark Configuration Retreiving Streaming Context from Spark 
Conf Trying to connect to RabbitMq Value Received BlockRDD[1] at 
ReceiverInputDStream at RabbitMQInputDStream.scala:33 Value Received 
BlockRDD[2] at ReceiverInputDStream at RabbitMQInputDStream.scala:33 |


In the logs I see the following:-

|15/11/18 13:20:45 INFO SparkContext: Running Spark version 1.5.2 
15/11/18 13:20:45 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
15/11/18 13:20:45 WARN Utils: Your hostname, jabong1143 resolves to a 
loopback address: 127.0.1.1; using 192.168.1.3 instead (on interface 
wlan0) 15/11/18 13:20:45 WARN Utils: Set SPARK_LOCAL_IP if you need to 
bind to another address 15/11/18 13:20:45 INFO SecurityManager: Changing 
view acls to: jabong 15/11/18 13:20:45 INFO SecurityManager: Changing 
modify acls to: jabong 15/11/18 13:20:45 INFO SecurityManager: 
SecurityManager: authentication disabled; ui acls disabled; users with 
view permissions: Set(jabong); users with modify permissions: 
Set(jabong) 15/11/18 13:20:46 INFO Slf4jLogger: Slf4jLogger started 
15/11/18 13:20:46 INFO Remoting: Starting remoting 15/11/18 13:20:46 
INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.3:42978] 15/11/18 13:20:46 INFO 
Utils: Successfully started service 'sparkDriver' on port 42978. 
15/11/18 13:20:46 INFO SparkEnv: Registering MapOutputTracker 15/11/18 
13:20:46 INFO SparkEnv: Registering BlockManagerMaster 15/11/18 13:20:46 
INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-9309b35f-a506-49dc-91ab-5c340cd3bdd1 15/11/18 13:20:46 
INFO MemoryStore: MemoryStore started with capacity 947.7 MB 15/11/18 
13:20:46 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-736f4b9c-764c-4b85-9b37-1cece102c95a/httpd-29196fa0-eb3f-4b7d-97ad-35c5325b09e5 
15/11/18 13:20:46 INFO HttpServer: Starting HTTP Server 15/11/18 
13:20:46 INFO Utils: Successfully started service 'HTTP file server' on 
port 37150. 15/11/18 13:20:46 INFO SparkEnv: Registering 

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian

/cc Spark user list

I'm confused here, you mentioned that you were writing Parquet files 
using MR jobs. What's the relation between that Parquet writing task and 
this JavaPairRDD one? Is it a separate problem?


Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1", 
"col2").format("").save(path)), and there's a 
spark-avro 
 
data source. If you are writing Avro records to multiple partitions, 
these two should help.


Cheng

On 11/19/15 4:30 PM, Shushant Arora wrote:

Thanks Cheng.

I have used avroParquetOutputFormat and it works fine.
my requirement is now to handle writing in multiple folders at same 
time. Basically the JavaPairrdd I want to write to 
multiple folders based on final hive partitions where this rdd will 
lend.Have you used multiple output formats in spark?




On Fri, Nov 13, 2015 at 3:56 PM, Cheng Lian > wrote:


Oh I see. Then parquet-avro should probably be more useful. AFAIK,
parquet-hive is only used internally in Hive. I don't see anyone
using it directly.

In general, you can first parse your text data, assemble them into
Avro records, and then write these records to Parquet.

BTW, Spark 1.2 also provides Parquet support. Since you're trying
to convert text data, I guess you probably don't have any nested
data. In that case, Spark 1.2 should be enough. it's not that
Spark 1.2 can't deal with nested data, it's about interoperability
with Hive. Because in the early days, Parquet spec itself didn't
specify how to write nested data. You may refer to this link for
more details:
http://spark.apache.org/docs/1.2.1/sql-programming-guide.html#parquet-files

Cheng


On 11/13/15 6:11 PM, Shushant Arora wrote:

No , I don't have data loaded in text form to hive- It was for
getting internals of what approach hive is taking .

I want direct writing to parquet file from MR job. For that Hive
Parquet datamodel vs Avro Parquet data model which approach is
better?


On Fri, Nov 13, 2015 at 3:24 PM, Cheng Lian
> wrote:

If you are already able to load the text data into Hive, then
using Hive itself to convert the data is obviously the
easiest and most compatible way. For example:

CREATE TABLE text_table (key INT, value STRING);
LOAD DATA LOCAL INPATH '/tmp/data.txt' INTO TABLE text_table;

CREATE TABLE parquet_table
STORED AS PARQUET
AS SELECT * FROM text_table;

Cheng


On 11/13/15 5:13 PM, Shushant Arora wrote:

Thanks !
so which one is better for dumping text data to hive using
custom MR/spark job - Hive Parquet datamodel using
hivewritable or Avro Parquet datamodel using avro object?

On Fri, Nov 13, 2015 at 12:45 PM, Cheng Lian
> wrote:

ParquetOutputFormat is not a data model. A data model
provides a WriteSupport to ParquetOutputFormat to tell
Parquet how to convert upper level domain objects (Hive
Writables, Avro records, etc.) to Parquet records. So
all other data models uses it for writing Parquet files.

Hive does have a Parquet data model. If you create a
Parquet table in Hive like "CREATE TABLE t (key INT,
value STRING) STORED AS PARQUET", it invokes the Hive
Parquet data model when reading/write table t. In the
case you mentioned, records in the text table are
firstly extracted out by Hive into Hive Writables, and
then the Hive Parquet data model converts those
Writables into Parquet records.

Cheng


On 11/13/15 2:37 PM, Shushant Arora wrote:

Thanks Cheng.

I have spark version 1.2 deployed on my cluster so for
the time being I cannot use direct spark sql functionality.
I will try with AvroParquetOutputFormat. Just want to
know how AvroParquetOutputFormat is better than direct
ParquetOutputFormat ? And also is there any hive object
model - I mean when I create a parquet table in hive
and insert data in that table using text table which
object model does hive uses internally?

Thanks
Shushant

On Fri, Nov 13, 2015 at 9:14 AM, Cheng Lian
>
wrote:

If I understand your question correctly, you are
trying to write Parquet files using a specific
Parquet data model, and expect to load them into
Hive, right?

  

ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi,

I try to throw an exception of my own exception class (MyException extends
SparkException) on one of the executors. This works fine on Spark 1.3.x,
1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x.
This happens only when I throw it on an executor, on the driver it
succeeds. I'm using Spark in yarn-cluster mode.

Is this a known issue? Is there any workaround for it?

StackTrace:

15/11/18 15:00:17 WARN spark.ThrowableSerializationWrapper: Task
exception could not be deserialized
java.lang.ClassNotFoundException: org.example.MyException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/18 15:00:17 ERROR scheduler.TaskResultGetter: Could not
deserialize TaskEndReason: ClassNotFound with classloader
org.apache.spark.util.MutableURLClassLoader@7578da02
15/11/18 15:00:17 WARN scheduler.TaskSetManager: Lost task 0.0 in
stage 1.0 (TID 30, hadoop2.local.dmlab.hu): UnknownReason

Regards,
Zsolt


Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-19 Thread David Rosenstrauch
I ran into this recently.  Turned out we had an old 
org-xerial-snappy.properties file in one of our conf directories that 
had the setting:


# Disables loading Snappy-Java native library bundled in the
# snappy-java-*.jar file forcing to load the Snappy-Java native
# library from the java.library.path.
#
org.xerial.snappy.disable.bundled.libs=true

When I switched that to false, it made the problem go away.

May or may not be your problem of course, but worth a look.

HTH,

DR

On 11/17/2015 05:22 PM, Andy Davidson wrote:

I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I have
3 slaves. In general I am running into trouble even with small work loads. I
am using IPython notebooks running on my spark cluster. Everything is
painfully slow. I am using the standAlone cluster manager. I noticed that I
am getting the following warning on my driver console. Any idea what the
problem might be?



15/11/17 22:01:59 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.

15/11/17 22:03:05 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

15/11/17 22:03:05 WARN LoadSnappy: Snappy native library not loaded



Here is an overview of my POS app. I have a file on hdfs containing about
5000 twitter status strings.

tweetStrings = sc.textFile(dataURL)
jTweets = (tweetStrings.map(lambda x: json.loads(x)).take(10))

Generated the following error ³error occurred while calling o78.partitions.:
java.lang.OutOfMemoryError: GC overhead limit exceeded²

Any idea what we need to do to improve new spark user¹s out of the box
experience?

Kind regards

Andy

export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"

MASTER_URL=spark://ec2-55-218-207-122.us-west-1.compute.amazonaws.com:7077


numCores=2

$SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores
$numCores $*








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the 
logs:

WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 
25.4 GB of 25.3 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.


On Nov 19, 2015, at 12:20 PM, Ted Yu 
> wrote:

Here are the parameters related to log aggregation :


  yarn.log-aggregation-enable
  true



  yarn.log-aggregation.retain-seconds
  2592000


  yarn.nodemanager.log-aggregation.compression-type
  gz



  yarn.nodemanager.log-aggregation.debug-enabled
  false



  yarn.nodemanager.log-aggregation.num-log-files-per-app
  30



  
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  -1


On Thu, Nov 19, 2015 at 8:14 AM, 
> 
wrote:
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.






External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Abhishek Anand
Hi ,

I am using spark streaming to write the aggregated output as parquet files
to the hdfs using SaveMode.Append. I have an external table created like :


CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
  path "hdfs:"
);

I had an impression that in case of external table the queries should fetch
the data from newly parquet added files also. But, seems like the newly
written files are not being picked up.

Dropping and recreating the table every time works fine but not a solution.


Please suggest how can my table have the data from newer files also.



Thanks !!
Abhi


Re: Invocation of StreamingContext.stop() hangs in 1.5

2015-11-19 Thread jiten
Hi,

Thanks to Ted Vu and Nilanjan. Stopping the streaming context
asynchronously did the trick!

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Invocation-of-StreamingContext-stop-hangs-in-1-5-tp25402p25434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Shuffle performance tuning. How to tune netty?

2015-11-19 Thread t3l
I am facing a very tricky issue here. I have a treeReduce task. The
reduce-function returns a very large object. In fact it is a Map[Int,
Array[Double]]. Each reduce task inserts and/or updates values into the map
or updates the array. My problem is, that this Map can become very large.
Currently, I am looking at about 500 MB (serialized size). The performance
of the entire reduce task is incredibly slow. While my reduce function takes
only about 10 seconds to execute, the shuffle-subsystem of Spark takes very
long. My task returns after about 100-300 seconds. This even happens if I
just have 2 nodes with 2 worker cores. So the only thing spark would have to
do is to send the 500 MB over the network (both machines are connected via
Gigabit Ethernet) which should take a couple of seconds.

It is also interesting to note that if I choose "nio" as block transport
manager, the speed is very good. Only a couple of seconds as expected. But
just discovered that "nio" support is discontinued. So, how can I get good
performance for such a usage scenario. Large objects, treeReduce, not very
many nodes with netty?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-performance-tuning-How-to-tune-netty-tp25433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Sabarish Sasidharan
The stack trace is clear enough:

Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method(reply-code=406,
reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue
'hello1' in vhost '/': received 'true' but current is 'false', class-id=50,
method-id=10)

Regards
Sab
On 19-Nov-2015 2:32 pm, "D"  wrote:

> I am trying to write a simple "Hello World" kind of application using
> spark streaming and RabbitMq, in which Apache Spark Streaming will read
> message from RabbitMq via the RabbitMqReceiver
>  and print it in the
> console. But some how I am not able to print the string read from Rabbit Mq
> into console. The spark streaming code is printing the message below:-
>
> Value Received BlockRDD[1] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
> Value Received BlockRDD[2] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
>
>
> The message is sent to the rabbitmq via the simple code below:-
>
> package helloWorld;
>
> import com.rabbitmq.client.Channel;
> import com.rabbitmq.client.Connection;
> import com.rabbitmq.client.ConnectionFactory;
>
> public class Send {
>
>   private final static String QUEUE_NAME = "hello1";
>
>   public static void main(String[] argv) throws Exception {
> ConnectionFactory factory = new ConnectionFactory();
> factory.setHost("localhost");
> Connection connection = factory.newConnection();
> Channel channel = connection.createChannel();
>
> channel.queueDeclare(QUEUE_NAME, false, false, false, null);
> String message = "Hello World! is a code. Hi Hello World!";
> channel.basicPublish("", QUEUE_NAME, null, message.getBytes("UTF-8"));
> System.out.println(" [x] Sent '" + message + "'");
>
> channel.close();
> connection.close();
>   }
> }
>
>
> I am trying to read messages via Apache Streaming as shown below:-
>
> package rabbitmq.example;
>
> import java.util.*;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>
> import com.stratio.receiver.RabbitMQUtils;
>
> public class RabbitMqEx {
>
> public static void main(String[] args) {
> System.out.println("CreatingSpark   Configuration");
> SparkConf conf = new SparkConf();
> conf.setAppName("RabbitMq Receiver Example");
> conf.setMaster("local[2]");
>
> System.out.println("Retreiving  Streaming   Context fromSpark   
> Conf");
> JavaStreamingContext streamCtx = new JavaStreamingContext(conf,
> Durations.seconds(2));
>
> MaprabbitMqConParams = new HashMap();
> rabbitMqConParams.put("host", "localhost");
> rabbitMqConParams.put("queueName", "hello1");
> System.out.println("Trying to connect to RabbitMq");
> JavaReceiverInputDStream receiverStream = 
> RabbitMQUtils.createJavaStream(streamCtx, rabbitMqConParams);
> receiverStream.foreachRDD(new Function() {
> @Override
> public Void call(JavaRDD arg0) throws Exception {
> System.out.println("Value Received " + arg0.toString());
> return null;
> }
> } );
> streamCtx.start();
> streamCtx.awaitTermination();
> }
> }
>
> The output console only has message like the following:-
>
> CreatingSpark   Configuration
> Retreiving  Streaming   Context fromSpark   Conf
> Trying to connect to RabbitMq
> Value Received BlockRDD[1] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
> Value Received BlockRDD[2] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
>
>
> In the logs I see the following:-
>
> 15/11/18 13:20:45 INFO SparkContext: Running Spark version 1.5.2
> 15/11/18 13:20:45 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/11/18 13:20:45 WARN Utils: Your hostname, jabong1143 resolves to a 
> loopback address: 127.0.1.1; using 192.168.1.3 instead (on interface wlan0)
> 15/11/18 13:20:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/11/18 13:20:45 INFO SecurityManager: Changing view acls to: jabong
> 15/11/18 13:20:45 INFO SecurityManager: Changing modify acls to: jabong
> 15/11/18 13:20:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jabong); users 
> with modify permissions: Set(jabong)
> 15/11/18 13:20:46 INFO Slf4jLogger: Slf4jLogger started
> 15/11/18 13:20:46 INFO Remoting: Starting remoting
> 15/11/18 13:20:46 INFO Remoting: Remoting started; listening on 

Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Shuai Zheng
Hi All,

 

I face a very weird case. I have already simplify the scenario to the most
so everyone can replay the scenario. 

 

My env:

 

AWS EMR 4.1.0, Spark1.5

 

My code can run without any problem when I run it in a local mode, and it
has no problem when it run on a EMR cluster with one master and one task
node. 

 

But when I try to run a multiple node (more than 1 task node, which means 3
nodes cluster), the tasks will never return from one of it. 

 

The log as below:

 

15/11/19 21:19:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
2, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
3, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)

 

So you can see the task will alternatively submitted to two instances, one
is ip-10-165-121-188 and another is ip-10-155-160-147.

And later only the tasks runs on the ip-10-165-121-188.ec2 will finish will
always just wait there, ip-10-155-160-147.ec2 never return.

 

The data and code has been tested in local mode, single spark cluster mode,
so it should not be an issue on logic or data.

 

And I have attached my test case here (I believe it is simple enough and no
any business logic is involved):

 

   public void createSiteGridExposure2() {

  JavaSparkContext ctx = this.createSparkContextTest("Test");

  ctx.textFile(siteEncodeLocation).flatMapToPair(new
PairFlatMapFunction() {

 @Override

 public Iterable> call(String
line) throws Exception {

   List> res = new
ArrayList>();

   return res;

 }

  }).collectAsMap();

  ctx.stop();

   }

 

protected JavaSparkContext createSparkContextTest(String appName) {

  SparkConf sparkConf = new SparkConf().setAppName(appName);

 

  JavaSparkContext ctx = new JavaSparkContext(sparkConf);

  Configuration hadoopConf = ctx.hadoopConfiguration();

  if (awsAccessKeyId != null) {

 

 hadoopConf.set("fs.s3.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");

 hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKeyId);

 hadoopConf.set("fs.s3.awsSecretAccessKey",
awsSecretAccessKey);

 

 hadoopConf.set("fs.s3n.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");

 hadoopConf.set("fs.s3n.awsAccessKeyId",
awsAccessKeyId);

 hadoopConf.set("fs.s3n.awsSecretAccessKey",
awsSecretAccessKey);

  }

  return ctx;

   }

 

 

Anyone has any idea why this happened? I am a bit lost because the code
works in local mode and 2 node (1 master 1 task) clusters, but when it move
a multiple task nodes cluster, I have this issue. No error no exception, not
even timeout (because I wait more than 1 hours and there is no timeout
also).

 

Regards,

 

Shuai



Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi,
Could you please submit this via JIRA as a bug report?  It will be very
helpful if you include the Spark version, system details, and other info
too.
Thanks!
Joseph

On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> *Issue:*
>
> I have a random forest model that am trying to load during streaming using
> following code.  The code is working fine when I am running the code from
> Eclipse but getting NPE when running the code using spark-submit.
>
>
>
> JavaStreamingContext jssc = new JavaStreamingContext(*jsc*, Durations.
> *seconds*(duration));
>
> System.*out*.println("& trying to get the context
> &&& " );
>
> final RandomForestModel model = 
> RandomForestModel.*load*(jssc.sparkContext().sc(),
> *MODEL_DIRECTORY*);//line 116 causing the issue.
>
> System.*out*.println("& model debug
> &&& " + model.toDebugString());
>
>
>
>
>
> *Exception Details:*
>
> INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0,
> whose tasks have all completed, from pool
>
> Exception in thread "main" java.lang.NullPointerException
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData.toSplit(DecisionTreeModel.scala:144)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at scala.Option.map(Option.scala:145)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:287)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTree(DecisionTreeModel.scala:268)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:251)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:250)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:250)
>
> at
> org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.loadTrees(treeEnsembleModels.scala:340)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel$.load(treeEnsembleModels.scala:72)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel.load(treeEnsembleModels.scala)
>
> at
> com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:116)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Nov 19, 2015 1:10:56 PM WARNING: parquet.hadoop.ParquetRecordReader: Can
> not initialize counter due 

Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is:

import org.apache.spark.sql.functions._

val df = ...
df.withColumn("id", monotonicallyIncreasingId())

-Andrew

2015-11-19 8:23 GMT-08:00 xiaohe lan :

> Hi,
>
> I have some csv file in HDFS with headers like col1, col2, col3, I want to
> add a column named id, so the a record would be 
>
> How can I do this using Spark SQL ? Can id be auto increment ?
>
> Thanks,
> Xiaohe
>


Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
I don't know if this actually has anything to do with why your job is
hanging, but since you are using EMR you should probably not set those
fs.s3 properties but rather let it use EMRFS, EMR's optimized Hadoop
FileSystem implementation for interacting with S3. One benefit is that it
will automatically pick up your AWS credentials from your EC2 instance role
rather than you having to configure them manually (since doing so is
insecure because you have to get the secret access key onto your instance).

If simply making that change does not fix the issue, a jstack of the hung
process would help you figure out what it is doing. You should also look at
the YARN container logs (which automatically get uploaded to your S3 logs
bucket if you have this enabled).

~ Jonathan

On Thu, Nov 19, 2015 at 1:32 PM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> I face a very weird case. I have already simplify the scenario to the most
> so everyone can replay the scenario.
>
>
>
> My env:
>
>
>
> AWS EMR 4.1.0, Spark1.5
>
>
>
> My code can run without any problem when I run it in a local mode, and it
> has no problem when it run on a EMR cluster with one master and one task
> node.
>
>
>
> But when I try to run a multiple node (more than 1 task node, which means
> 3 nodes cluster), the tasks will never return from one of it.
>
>
>
> The log as below:
>
>
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
> 2, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 3, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
>
>
> So you can see the task will alternatively submitted to two instances, one
> is ip-10-165-121-188 and another is ip-10-155-160-147.
>
> And later only the tasks runs on the ip-10-165-121-188.ec2 will finish
> will always just wait there, ip-10-155-160-147.ec2 never return.
>
>
>
> The data and code has been tested in local mode, single spark cluster
> mode, so it should not be an issue on logic or data.
>
>
>
> And I have attached my test case here (I believe it is simple enough and
> no any business logic is involved):
>
>
>
>*public* *void* createSiteGridExposure2() {
>
>   JavaSparkContext ctx = *this*.createSparkContextTest("Test"
> );
>
>   ctx.textFile(siteEncodeLocation).flatMapToPair(*new* 
> *PairFlatMapFunction String, String>()* {
>
>  @Override
>
>  *public* Iterable>
> call(String line) *throws* Exception {
>
>List> res = *new*
> ArrayList>();
>
>*return* res;
>
>  }
>
>   }).collectAsMap();
>
>   ctx.stop();
>
>}
>
>
>
> *protected* JavaSparkContext createSparkContextTest(String appName) {
>
>   SparkConf sparkConf = *new* SparkConf().setAppName(appName);
>
>
>
>   JavaSparkContext ctx = *new* JavaSparkContext(sparkConf);
>
>   Configuration hadoopConf = ctx.hadoopConfiguration();
>
>   *if* (awsAccessKeyId != *null*) {
>
>
>
>  hadoopConf.set("fs.s3.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
>
>  hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKeyId
> );
>
>  hadoopConf.set("fs.s3.awsSecretAccessKey",
> awsSecretAccessKey);
>
>
>
>  hadoopConf.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
>
>  hadoopConf.set("fs.s3n.awsAccessKeyId",
> awsAccessKeyId);
>
>  hadoopConf.set("fs.s3n.awsSecretAccessKey",
> awsSecretAccessKey);
>
>   }
>
>   *return* ctx;
>
>}
>
>
>
>
>
> Anyone has any idea why this happened? I am a bit lost because the code
> works in local mode and 2 node (1 master 1 task) clusters, but when it move
> a multiple task nodes cluster, I have this issue. No error no exception,
> not even timeout (because I wait more than 1 hours and there is no timeout
> also).
>
>
>
> Regards,
>
>
>
> Shuai
>


newbie: unable to use all my cores and memory

2015-11-19 Thread Andy Davidson
I am having a heck of a time figuring out how to utilize my cluster
effectively. I am using the stand alone cluster manager. I have a master
and 3 slaves. Each machine has 2 cores.

I am trying to run a streaming app in cluster mode and pyspark at the same
time.

t1) On my console I see

* Alive Workers: 3
* Cores in use: 6 Total, 0 Used
* Memory in use: 18.8 GB Total, 0.0 B Used
* Applications: 0 Running, 15 Completed
* Drivers: 0 Running, 2 Completed
* Status: ALIVE

t2) I start my streaming app

$SPARK_ROOT/bin/spark-submit \
--class "com.pws.spark.streaming.IngestDriver" \
--master $MASTER_URL \
--total-executor-cores 2 \
--deploy-mode cluster \
$jarPath --clusterMode  $*

t3) on my console I see

* Alive Workers: 3
* Cores in use: 6 Total, 3 Used
* Memory in use: 18.8 GB Total, 13.0 GB Used
* Applications: 1 Running, 15 Completed
* Drivers: 1 Running, 2 Completed
* Status: ALIVE

Looks like pyspark should be able to use the 3 remaining cores and 5.8 GB
of memory

t4) I start pyspark

export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"

$SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores 3
--executor-memory 2g

t5) on my console I see

* Alive Workers: 3
* Cores in use: 6 Total, 4 Used
* Memory in use: 18.8 GB Total, 15.0 GB Used
* Applications: 2 Running, 18 Completed
* Drivers: 1 Running, 2 Completed
* Status: ALIVE


I have 2 unused cores and a lot of memory left over. My pyspark
application is going getting 1 core. If streaming app is not running
pyspark would be assigned 2 cores each on a different worker. I have tried
using various combinations of --executor-cores and --total-executor-cores.
Any idea how to get pyspark to use more cores and memory?


Kind regards

Andy

P.s.  Using different values I have wound up with  pyspark status ==
³waiting² I think this is because there are not enough cores available?




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Blocked REPL commands

2015-11-19 Thread Jacek Laskowski
Hi,

Dunno the answer, but :reset should be blocked, too, for obvious reasons.

➜  spark git:(master) ✗ ./bin/spark-shell
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :reset
Resetting interpreter state.
Forgetting this session history:


 @transient val sc = {
   val _sc = org.apache.spark.repl.Main.createSparkContext()
   println("Spark context available as sc.")
   _sc
 }


 @transient val sqlContext = {
   val _sqlContext = org.apache.spark.repl.Main.createSQLContext()
   println("SQL context available as sqlContext.")
   _sqlContext
 }

import org.apache.spark.SparkContext._
import sqlContext.implicits._
import sqlContext.sql
import org.apache.spark.sql.functions._
...

scala> import org.apache.spark._
import org.apache.spark._

scala> val sc = new SparkContext("local[*]", "shell", new SparkConf)
...
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243). To ignore this error, set
spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:82)
...

Guess I should file an issue?

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Apache Spark
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Thu, Nov 19, 2015 at 8:44 PM, Jakob Odersky  wrote:
> I was just going through the spark shell code and saw this:
>
> private val blockedCommands = Set("implicits", "javap", "power", "type",
> "kind")
>
> What is the reason as to why these commands are blocked?
>
> thanks,
> --Jakob

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Rachana Srivastava
Issue:
I have a random forest model that am trying to load during streaming using 
following code.  The code is working fine when I am running the code from 
Eclipse but getting NPE when running the code using spark-submit.

JavaStreamingContext jssc = new JavaStreamingContext(jsc, 
Durations.seconds(duration));
System.out.println("& trying to get the context 
&&& " );
final RandomForestModel model = 
RandomForestModel.load(jssc.sparkContext().sc(), MODEL_DIRECTORY);//line 116 
causing the issue.
System.out.println("& model debug &&& " 
+ model.toDebugString());


Exception Details:
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0, 
whose tasks have all completed, from pool
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData.toSplit(DecisionTreeModel.scala:144)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:291)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:287)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTree(DecisionTreeModel.scala:268)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:251)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:250)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:250)
at 
org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.loadTrees(treeEnsembleModels.scala:340)
at 
org.apache.spark.mllib.tree.model.RandomForestModel$.load(treeEnsembleModels.scala:72)
at 
org.apache.spark.mllib.tree.model.RandomForestModel.load(treeEnsembleModels.scala)
at 
com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:116)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Nov 19, 2015 1:10:56 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not 
initialize counter due to context is not a instance of TaskInputOutputContext, 
but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl

Spark Source Code:
case class PredictData(predict: Double, prob: Double) {
  def toPredict: Predict = new Predict(predict, prob)
}

Thanks,

Rachana




Re: Error not found value sqlContext

2015-11-19 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13

On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j 
wrote:

> HI All,
> we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching
> data from an RDBMS using JDBCRDD and register it as temp table to perform
> SQL query
>
> Below approach is working fine in Spark 1.2.1:
>
> JDBCRDD --> apply map using Case Class --> apply createSchemaRDD -->
> registerTempTable --> perform SQL Query
>
> but now as createSchemaRDD is not supported in Spark 1.4.0
>
> JDBCRDD --> apply map using Case Class with* .toDF()* -->
> registerTempTable --> perform SQL query on temptable
>
>
> JDBCRDD --> apply map using Case Class --> RDD*.toDF()*.registerTempTable
> --> perform SQL query on temptable
>
> Only solution I get everywhere is to  use "import sqlContext.implicits._"
> after val SQLContext = new org.apache.spark.sql.SQLContext(sc)
>
> But it errors with the two generic errors
>
> *1. error: not found: value sqlContext*
>
> *2. value toDF is not a member of org.apache.spark.rdd.RDD*
>
>
>
>
>
>


Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick

Hi,

On Spark 1.5 on EMR 4.1 the message below appears in stderr in the Yarn UI.

  ERROR StatusLogger No log4j2 configuration file found. Using default 
configuration: logging only errors to the console.

I do see that there is

   /usr/lib/spark/conf/log4j.properties

Can someone please advise on how to setup log4j properly.

Thanks,
--
  Nick

Notice: This communication is for the intended recipient(s) only and may 
contain confidential, proprietary, legally protected or privileged information 
of Turbine, Inc. If you are not the intended recipient(s), please notify the 
sender at once and delete this communication. Unauthorized use of the 
information in this communication is strictly prohibited and may be unlawful. 
For those recipients under contract with Turbine, Inc., the information in this 
communication is subject to the terms and conditions of any applicable 
contracts or agreements.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Drop multiple columns in the DataFrame API

2015-11-19 Thread Benjamin Fradet
Hi everyone,

I was wondering if there is a better way to drop mutliple columns from a
dataframe or why there is no drop(cols: Column*) method in the dataframe
API.

Indeed, I tend to write code like this:

val filteredDF = df.drop("colA")
   .drop("colB")
   .drop("colC")
//etc

which is a bit lengthy, or:

val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns
  .filter(colName => !colsToRemove.contains(colName))
  .map(colName => new Column(colName)): _*)

which is, I think, a bit ugly.

Thanks,

-- 
Ben Fradet.


Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky
that definitely looks like a bug, go ahead with filing an issue
I'll check the scala repl source code to see what, if any, other commands
there are that should be disabled

On 19 November 2015 at 12:54, Jacek Laskowski  wrote:

> Hi,
>
> Dunno the answer, but :reset should be blocked, too, for obvious reasons.
>
> ➜  spark git:(master) ✗ ./bin/spark-shell
> ...
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> :reset
> Resetting interpreter state.
> Forgetting this session history:
>
>
>  @transient val sc = {
>val _sc = org.apache.spark.repl.Main.createSparkContext()
>println("Spark context available as sc.")
>_sc
>  }
>
>
>  @transient val sqlContext = {
>val _sqlContext = org.apache.spark.repl.Main.createSQLContext()
>println("SQL context available as sqlContext.")
>_sqlContext
>  }
>
> import org.apache.spark.SparkContext._
> import sqlContext.implicits._
> import sqlContext.sql
> import org.apache.spark.sql.functions._
> ...
>
> scala> import org.apache.spark._
> import org.apache.spark._
>
> scala> val sc = new SparkContext("local[*]", "shell", new SparkConf)
> ...
> org.apache.spark.SparkException: Only one SparkContext may be running
> in this JVM (see SPARK-2243). To ignore this error, set
> spark.driver.allowMultipleContexts = true. The currently running
> SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:82)
> ...
>
> Guess I should file an issue?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Apache Spark
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Nov 19, 2015 at 8:44 PM, Jakob Odersky  wrote:
> > I was just going through the spark shell code and saw this:
> >
> > private val blockedCommands = Set("implicits", "javap", "power",
> "type",
> > "kind")
> >
> > What is the reason as to why these commands are blocked?
> >
> > thanks,
> > --Jakob
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Streaming Job gives error after changing to version 1.5.2

2015-11-19 Thread swetha kasireddy
That was actually an issue with our Mesos.

On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das  wrote:

> If possible, could you give us the root cause and solution for future
> readers of this thread.
>
> On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> It works fine after some changes.
>>
>> -Thanks,
>> Swetha
>>
>> On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das 
>> wrote:
>>
>>> Can you verify that the cluster is running the correct version of Spark.
>>> 1.5.2.
>>>
>>> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
 Sorry compile makes it work locally. But, the cluster
 still seems to have issues with provided. Basically it
 does not seem to process any records, no data is shown in any of the tabs
 of the Streaming UI except the Streaming tab. Executors, Storage, Stages
 etc show empty RDDs.

 On Tue, Nov 17, 2015 at 7:19 PM, swetha kasireddy <
 swethakasire...@gmail.com> wrote:

> Hi TD,
>
> Basically, I see two issues. With provided the job
> does not start localy. It does start in Cluster but seems  no data is
> getting processed.
>
> Thanks,
> Swetha
>
> On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram  > wrote:
>
>> If you are running a local context, could it be that you should use:
>>
>>
>>
>> provided
>>
>>
>>
>> ?
>>
>>
>>
>> Thanks,
>>
>> Tim
>>
>>
>>
>> *From:* swetha kasireddy [mailto:swethakasire...@gmail.com]
>> *Sent:* Wednesday, 18 November 2015 2:01 PM
>> *To:* Tathagata Das
>> *Cc:* user
>> *Subject:* Re: Streaming Job gives error after changing to version
>> 1.5.2
>>
>>
>>
>> This error I see locally.
>>
>>
>>
>> On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das 
>> wrote:
>>
>> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>>
>>
>>
>> On Tue, Nov 17, 2015 at 5:34 PM, swetha 
>> wrote:
>>
>>
>>
>> Hi,
>>
>> I see  java.lang.NoClassDefFoundError after changing the Streaming job
>> version to 1.5.2. Any idea as to why this is happening? Following are
>> my
>> dependencies and the error that I get.
>>
>>   
>> org.apache.spark
>> spark-core_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-streaming_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-sql_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-hive_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>>
>> 
>> org.apache.spark
>> spark-streaming-kafka_2.10
>> ${sparkVersion}
>> 
>>
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/streaming/StreamingContext
>> at java.lang.Class.getDeclaredMethods0(Native Method)
>> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
>> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
>> at java.lang.Class.getMethod0(Class.java:3010)
>> at java.lang.Class.getMethod(Class.java:1776)
>> at
>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.streaming.StreamingContext
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional 

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi Tamás,

the exception class is in the application jar, I'm using the spark-submit
script.

2015-11-19 11:54 GMT+01:00 Tamas Szuromi :

> Hi Zsolt,
>
> How you load the jar and how you prepend it to the classpath?
>
> Tamas
>
>
>
>
> On 19 November 2015 at 11:02, Zsolt Tóth  wrote:
>
>> Hi,
>>
>> I try to throw an exception of my own exception class (MyException
>> extends SparkException) on one of the executors. This works fine on Spark
>> 1.3.x, 1.4.x but throws a deserialization/ClassNotFound exception on Spark
>> 1.5.x. This happens only when I throw it on an executor, on the driver it
>> succeeds. I'm using Spark in yarn-cluster mode.
>>
>> Is this a known issue? Is there any workaround for it?
>>
>> StackTrace:
>>
>> 15/11/18 15:00:17 WARN spark.ThrowableSerializationWrapper: Task exception 
>> could not be deserialized
>> java.lang.ClassNotFoundException: org.example.MyException
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  at java.lang.Class.forName0(Native Method)
>>  at java.lang.Class.forName(Class.java:270)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>>  at 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>>  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>  at 
>> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>  at 
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>>  at 
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>>  at 
>> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
>>  at 
>> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
>>  at 
>> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
>>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>  at 
>> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> 15/11/18 15:00:17 ERROR scheduler.TaskResultGetter: Could not deserialize 
>> TaskEndReason: ClassNotFound with classloader 
>> org.apache.spark.util.MutableURLClassLoader@7578da02
>> 15/11/18 15:00:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
>> (TID 30, hadoop2.local.dmlab.hu): UnknownReason
>>
>> Regards,
>> Zsolt
>>
>
>


Moving avg in saprk streaming

2015-11-19 Thread anshu shukla
Any formal way to do moving avg over fixed window duration .

I calculated a simple moving average by creating a count stream and a sum
stream; then joined them and finally calculated the mean. This was not per
time window since time periods were part of the tuples.

-- 
Thanks & Regards,
Anshu Shukla


Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-19 Thread swetha kasireddy
OK. We have a long running streaming job. I was thinking that may be we
should have a cron to clear files that are older than 2 days. What would be
an appropriate way to do that?

On Wed, Nov 18, 2015 at 7:43 PM, Ted Yu  wrote:

> Have you seen SPARK-5836 ?
> Note TD's comment at the end.
>
> Cheers
>
> On Wed, Nov 18, 2015 at 7:28 PM, swetha  wrote:
>
>> Hi,
>>
>> We have a lot of temp files that gets created due to shuffles caused by
>> group by. How to clear the files that gets created due to intermediate
>> operations in group by?
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear-the-temp-files-that-gets-created-by-shuffle-in-Spark-Streaming-tp25425.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


FastUtil DataStructures in Spark

2015-11-19 Thread swetha
Hi,

Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any
example would be of great help.

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Jeff Zhang
Seems your jdbc url is not correct. Should be jdbc:mysql://
192.168.41.229:3306

On Thu, Nov 19, 2015 at 6:03 PM,  wrote:

> hi guy,
>
>I also found  --driver-class-path and spark.driver.extraClassPath
> is not working when I'm accessing mysql driver in my spark APP.
>
>the attache is my config for my APP.
>
>
> here are my command and the  logs of the failure i encountted.
>
>
> [root@h170 spark]#  bin/spark-shell --master spark://h170:7077
> --driver-class-path lib/mysql-connector-java-5.1.32-bin.jar --jars
> lib/mysql-connector-java-5.1.32-bin.jar
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
>
> To adjust logging level use sc.setLogLevel("INFO")
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>
>   /_/
>
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_79)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
> 15/11/19 17:51:33 WARN MetricsSystem: Using default name DAGScheduler for
> source because spark.app.id is not set.
>
> Spark context available as sc.
>
> 15/11/19 17:51:35 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/19 17:51:35 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/19 17:51:48 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
>
> 15/11/19 17:51:48 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
>
> 15/11/19 17:51:50 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 15/11/19 17:51:50 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/19 17:51:50 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> SQL context available as sqlContext.
>
>
> scala> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" ->
> "mysql://192.168.41.229:3306",  "dbtable" -> "sqoop.SQOOP_ROOT")).load()
>
> java.sql.SQLException: No suitable driver found for mysql://
> 192.168.41.229:3306
>
> at java.sql.DriverManager.getConnection(DriverManager.java:596)
>
> at java.sql.DriverManager.getConnection(DriverManager.java:187)
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>
> at
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>
> at $iwC$$iwC$$iwC$$iwC.(:30)
>
> at $iwC$$iwC$$iwC.(:32)
>
> at $iwC$$iwC.(:34)
>
> at $iwC.(:36)
>
> at (:38)
>
> at .(:42)
>
> at .()
>
> at .(:7)
>
> at .()
>
> at $print()
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>
> at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>
> at
> 

Why Spark Streaming keeps all batches in memory after processing?

2015-11-19 Thread Artem Moskvin
Hello there!

I wonder why Spark Streaming keeps all processed batches in memory? It
leads to getting out of memory on executors but I really don't need them
after processing. Can it be configured somewhere so that batches are not
kept in memory after processing?

Respectfully,
Artem Moskvin


Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Tamas Szuromi
Hi Zsolt,

How you load the jar and how you prepend it to the classpath?

Tamas




On 19 November 2015 at 11:02, Zsolt Tóth  wrote:

> Hi,
>
> I try to throw an exception of my own exception class (MyException extends
> SparkException) on one of the executors. This works fine on Spark 1.3.x,
> 1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x.
> This happens only when I throw it on an executor, on the driver it
> succeeds. I'm using Spark in yarn-cluster mode.
>
> Is this a known issue? Is there any workaround for it?
>
> StackTrace:
>
> 15/11/18 15:00:17 WARN spark.ThrowableSerializationWrapper: Task exception 
> could not be deserialized
> java.lang.ClassNotFoundException: org.example.MyException
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/11/18 15:00:17 ERROR scheduler.TaskResultGetter: Could not deserialize 
> TaskEndReason: ClassNotFound with classloader 
> org.apache.spark.util.MutableURLClassLoader@7578da02
> 15/11/18 15:00:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 30, hadoop2.local.dmlab.hu): UnknownReason
>
> Regards,
> Zsolt
>


spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
I am working on a simple POS. I am running into a really strange problem. I
wrote a java streaming app that collects tweets using the spark twitter
package and stores the to disk in JSON format. I noticed that when I run the
code on my mac. The file are written to the local files system as I expect
I.E. In valid JSON format. The key names are double quoted. Boolean values
are the works true or false in lower case.



When I run in my cluster the only difference is I call
data.saveAsTextFiles() using an hdfs: URI instead of using file:/// . When
the files are written to HDFS the JSON is not valid. E.G. Key names are
single quoted not double quoted. Boolean values are the string False or
True, notice they start with upper case. I suspect there will be other
problems. Any idea what I am doing wrong?



I am using spark-1.5.1-bin-hadoop2.6



import twitter4j.Status;

import com.fasterxml.jackson.databind.ObjectMapper;




   private static ObjectMapper mapper = new ObjectMapper();

static {

mapper.setVisibility(PropertyAccessor.FIELD,
JsonAutoDetect.Visibility.ANY);

mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES,
false);

}


   JavaDStream jsonTweets = tweets.map(mapStatusToJson);

DStream data = jsonTweets.dstream();

data.saveAsTextFiles(outputUri, null);



class MapStatusToJson implements Function {

private static final long serialVersionUID = -2882921625604816575L;



@Override

public String call(Status status) throws Exception {

return mapper.writeValueAsString(status);



}



I have been using pyspark to explore the data.



dataURL = "hdfs:///smallSample"

tweetStrings = sc.textFile(dataURL) # I looked at source it decoded as UTF-8

def str2JSON(statusStr):

"""If string is not valid JSON return 'none' and creates a log.warn()"""

try:

ret = json.loads(statusStr)

return ret

except Exception as exc :

logging.warning("bad JSON")

logging.warning(exc)

logging.warning(statusStr)

numJSONExceptions.add(1)

return None #null value





#jTweets = tweetStrings.map(lambda x: json.loads(x)).take(10)

jTweets = tweetStrings.map(str2JSON).take(10)



If I call print tweetStrings.take(1)

I would get back the following string. (its really long only provided part
of)


{'favorited': False, 'inReplyToStatusId': -1, 'inReplyToScreenName': None,
'urlentities': [{'end': 126, 'expandedURL': 'http://panth.rs/pZ9Cvv',


If I copy one of the hdfs part files locally I would see something similar.
So I think the problem has something to do with DStream.saveAsTextFiles().



I do not know if this is the problem or not, How ever it looks like the
system might depend of several version of jackson and fasterxml.jackson



Has anyone else run into this problem?



Kind regards



Andy



provided

+--- org.apache.spark:spark-streaming_2.10:1.5.1

|+--- org.apache.spark:spark-core_2.10:1.5.1

||+--- org.apache.avro:avro-mapred:1.7.7

|||+--- org.apache.avro:avro-ipc:1.7.7

||||+--- org.apache.avro:avro:1.7.7

|||||+--- org.codehaus.jackson:jackson-core-asl:1.9.13

|||||+--- org.codehaus.jackson:jackson-mapper-asl:1.9.13

||||||\---
org.codehaus.jackson:jackson-core-asp:1.9.13





||+--- org.apache.hadoop:hadoop-client:2.2.0

|||+--- org.apache.hadoop:hadoop-common:2.2.0

||||+--- org.slf4j:slf4j-api:1.7.5 -> 1.7.10

||||+--- org.codehaus.jackson:jackson-core-asl:1.8.8 ->
1.9.13





|||\--- com.fasterxml.jackson.core:jackson-databind:2.3.1 ->
2.4.4

||| +---
com.fasterxml.jackson.core:jackson-annotations:2.4.0 -> 2.4.4

||| \--- com.fasterxml.jackson.core:jackson-core:2.4.4

||+--- com.sun.jersey:jersey-server:1.9 (*)

||+--- com.sun.jersey:jersey-core:1.9

||+--- org.apache.mesos:mesos:0.21.1

||+--- io.netty:netty-all:4.0.29.Final

||+--- com.clearspring.analytics:stream:2.7.0

||+--- io.dropwizard.metrics:metrics-core:3.1.2

|||\--- org.slf4j:slf4j-api:1.7.7 -> 1.7.10

||+--- io.dropwizard.metrics:metrics-jvm:3.1.2

|||+--- io.dropwizard.metrics:metrics-core:3.1.2 (*)

|||\--- org.slf4j:slf4j-api:1.7.7 -> 1.7.10

||+--- io.dropwizard.metrics:metrics-json:3.1.2

|||+--- io.dropwizard.metrics:metrics-core:3.1.2 (*)

|||+--- com.fasterxml.jackson.core:jackson-databind:2.4.2 ->
2.4.4 (*)

|||\--- org.slf4j:slf4j-api:1.7.7 -> 1.7.10

||+--- io.dropwizard.metrics:metrics-graphite:3.1.2

|||+--- io.dropwizard.metrics:metrics-core:3.1.2 (*)

|||\--- org.slf4j:slf4j-api:1.7.7 -> 1.7.10

||+--- com.fasterxml.jackson.core:jackson-databind:2.4.4 (*)

||+--- 

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2 when compile spark-1.5.2

2015-11-19 Thread ck...@126.com
Hey everyone

when compile spark-1.5.2 using Intelligent IDEA
there is an error like this:

[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-core_2.10: wrap: java.lang.ClassNotFoundException: 
xsbt.CompilerInterface: invalid LOC header (bad signature) -> [Help 1]

and I don't know how to fix it 。

Any help is appreciated.

Thanks,





scala-maven-plugin[INFO] Building Spark Project Core 1.6.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @ 
spark-core_2.10 ---
[INFO] 
[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
spark-core_2.10 ---
[INFO] Add Source directory: D:\spark\spark\core\src\main\scala
[INFO] Add Test Source directory: D:\spark\spark\core\src\test\scala
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
spark-core_2.10 ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
spark-core_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 21 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 484 Scala sources and 59 Java sources to 
D:\spark\spark\core\target\scala-2.10\classes...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  3.278 s]
[INFO] Spark Project Test Tags  SUCCESS [  2.841 s]
[INFO] Spark Project Launcher . SUCCESS [  8.002 s]
[INFO] Spark Project Networking ... SUCCESS [  3.163 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  2.563 s]
[INFO] Spark Project Unsafe ... SUCCESS [  2.391 s]
[INFO] Spark Project Core . FAILURE [  6.500 s]
[INFO] Spark Project Bagel  SKIPPED
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Twitter . SKIPPED
[INFO] Spark Project External Flume Sink .. SKIPPED
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Project External MQTT  SKIPPED
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 30.546 s
[INFO] Finished at: 2015-11-20T10:57:32+08:00
[INFO] Final Memory: 65M/999M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-core_2.10: wrap: java.lang.ClassNotFoundException: 
xsbt.CompilerInterface: invalid LOC header (bad signature) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-core_2.10




陈凯
手机:15700162786
邮箱:ck...@126.com
QQ:941831413



RE: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick

< log4j.properties file only exists on the master and not the slave nodes, so 
you are probably running into 
https://issues.apache.org/jira/browse/SPARK-11105, which has already been fixed 
in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0 once it is 
released.

Thanks for the info, though this is a single-node cluster so that can't be the 
cause of the error (which is in the driver log).
--
  Nick

From: Jonathan Kelly [jonathaka...@gmail.com]
Sent: Thursday, November 19, 2015 6:45 PM
To: Afshartous, Nick
Cc: user@spark.apache.org
Subject: Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

This file only exists on the master and not the slave nodes, so you are 
probably running into https://issues.apache.org/jira/browse/SPARK-11105, which 
has already been fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to 
Spark 1.6.0 once it is released.

~ Jonathan

On Thu, Nov 19, 2015 at 1:30 PM, Afshartous, Nick 
> wrote:

Hi,

On Spark 1.5 on EMR 4.1 the message below appears in stderr in the Yarn UI.

  ERROR StatusLogger No log4j2 configuration file found. Using default 
configuration: logging only errors to the console.

I do see that there is

   /usr/lib/spark/conf/log4j.properties

Can someone please advise on how to setup log4j properly.

Thanks,
--
  Nick

Notice: This communication is for the intended recipient(s) only and may 
contain confidential, proprietary, legally protected or privileged information 
of Turbine, Inc. If you are not the intended recipient(s), please notify the 
sender at once and delete this communication. Unauthorized use of the 
information in this communication is strictly prohibited and may be unlawful. 
For those recipients under contract with Turbine, Inc., the information in this 
communication is subject to the terms and conditions of any applicable 
contracts or agreements.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



Notice: This communication is for the intended recipient(s) only and may 
contain confidential, proprietary, legally protected or privileged information 
of Turbine, Inc. If you are not the intended recipient(s), please notify the 
sender at once and delete this communication. Unauthorized use of the 
information in this communication is strictly prohibited and may be unlawful. 
For those recipients under contract with Turbine, Inc., the information in this 
communication is subject to the terms and conditions of any applicable 
contracts or agreements.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.




has any spark write orc document

2015-11-19 Thread zhangjp
Hi,
has any spark write orc document which like the parquet document.
 http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files


Thanks

Re: has any spark write orc document

2015-11-19 Thread Jeff Zhang
It should be very similar with parquet in the api perspective, Please refer
this doc

http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/


On Fri, Nov 20, 2015 at 2:59 PM, zhangjp <592426...@qq.com> wrote:

> Hi,
> has any spark write orc document which like the parquet document.
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
>
> Thanks
>



-- 
Best Regards

Jeff Zhang


Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Daniel Carroza
Hi,

Just answered on Github:
https://github.com/Stratio/rabbitmq-receiver/issues/20

Regards

Daniel Carroza Santana

Vía de las Dos Castillas, 33, Ática 4, 3ª Planta.
28224 Pozuelo de Alarcón. Madrid.
Tel: +34 91 828 64 73 // *@stratiobd *

2015-11-19 10:02 GMT+01:00 D :

> I am trying to write a simple "Hello World" kind of application using
> spark streaming and RabbitMq, in which Apache Spark Streaming will read
> message from RabbitMq via the RabbitMqReceiver
>  and print it in the
> console. But some how I am not able to print the string read from Rabbit Mq
> into console. The spark streaming code is printing the message below:-
>
> Value Received BlockRDD[1] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
> Value Received BlockRDD[2] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
>
>
> The message is sent to the rabbitmq via the simple code below:-
>
> package helloWorld;
>
> import com.rabbitmq.client.Channel;
> import com.rabbitmq.client.Connection;
> import com.rabbitmq.client.ConnectionFactory;
>
> public class Send {
>
>   private final static String QUEUE_NAME = "hello1";
>
>   public static void main(String[] argv) throws Exception {
> ConnectionFactory factory = new ConnectionFactory();
> factory.setHost("localhost");
> Connection connection = factory.newConnection();
> Channel channel = connection.createChannel();
>
> channel.queueDeclare(QUEUE_NAME, false, false, false, null);
> String message = "Hello World! is a code. Hi Hello World!";
> channel.basicPublish("", QUEUE_NAME, null, message.getBytes("UTF-8"));
> System.out.println(" [x] Sent '" + message + "'");
>
> channel.close();
> connection.close();
>   }
> }
>
>
> I am trying to read messages via Apache Streaming as shown below:-
>
> package rabbitmq.example;
>
> import java.util.*;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>
> import com.stratio.receiver.RabbitMQUtils;
>
> public class RabbitMqEx {
>
> public static void main(String[] args) {
> System.out.println("CreatingSpark   Configuration");
> SparkConf conf = new SparkConf();
> conf.setAppName("RabbitMq Receiver Example");
> conf.setMaster("local[2]");
>
> System.out.println("Retreiving  Streaming   Context fromSpark   
> Conf");
> JavaStreamingContext streamCtx = new JavaStreamingContext(conf,
> Durations.seconds(2));
>
> MaprabbitMqConParams = new HashMap();
> rabbitMqConParams.put("host", "localhost");
> rabbitMqConParams.put("queueName", "hello1");
> System.out.println("Trying to connect to RabbitMq");
> JavaReceiverInputDStream receiverStream = 
> RabbitMQUtils.createJavaStream(streamCtx, rabbitMqConParams);
> receiverStream.foreachRDD(new Function() {
> @Override
> public Void call(JavaRDD arg0) throws Exception {
> System.out.println("Value Received " + arg0.toString());
> return null;
> }
> } );
> streamCtx.start();
> streamCtx.awaitTermination();
> }
> }
>
> The output console only has message like the following:-
>
> CreatingSpark   Configuration
> Retreiving  Streaming   Context fromSpark   Conf
> Trying to connect to RabbitMq
> Value Received BlockRDD[1] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
> Value Received BlockRDD[2] at ReceiverInputDStream at 
> RabbitMQInputDStream.scala:33
>
>
> In the logs I see the following:-
>
> 15/11/18 13:20:45 INFO SparkContext: Running Spark version 1.5.2
> 15/11/18 13:20:45 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/11/18 13:20:45 WARN Utils: Your hostname, jabong1143 resolves to a 
> loopback address: 127.0.1.1; using 192.168.1.3 instead (on interface wlan0)
> 15/11/18 13:20:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/11/18 13:20:45 INFO SecurityManager: Changing view acls to: jabong
> 15/11/18 13:20:45 INFO SecurityManager: Changing modify acls to: jabong
> 15/11/18 13:20:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jabong); users 
> with modify permissions: Set(jabong)
> 15/11/18 13:20:46 INFO Slf4jLogger: Slf4jLogger started
> 15/11/18 13:20:46 INFO Remoting: Starting remoting
> 15/11/18 13:20:46 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@192.168.1.3:42978]
> 

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176
 -containerId container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM,  wrote:

> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
> transforms on a JSON data set that I load into a data frame. The data set
> is not large (~100GB) and most stages execute without any issues. However,
> some more complex stages tend to lose executors/nodes regularly. What would
> cause this to happen? The logs don’t give too much information -
>
> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
> container_1445957755572_0176_01_03)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> [Stage 33:===> (117 + 50)
> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>
>  - Followed by a list of lost tasks on each executor.


Spark 1.5.3 release

2015-11-19 Thread Madabhattula Rajesh Kumar
Hi,

Please let me know Spark 1.5.3 release date details

Regards,
Rajesh


PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information - 

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.

Re: Spark streaming and custom partitioning

2015-11-19 Thread Cody Koeninger
Not sure what you mean by "no documentation regarding ways to achieve
effective communication between the 2", but the docs on integrating with
kafka are at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html

As far as custom partitioners go, Learning Spark from O'Reilly has a good
section on partitioning, or you can grep the spark code for "extends
Partitioner" to see examples

On Thu, Nov 19, 2015 at 8:58 AM, Sachin Mousli  wrote:

> Hi,
>
> I would like to implement a custom partitioning algorithm in a streaming
> environment, preferably in a generic manner using a single pass. The
> only sources I could find mention Apache Kafka which adds a complexity I
> would like to avoid since there seem to be no documentation regarding
> ways to achieve effective communication between the 2.
>
> Could you give me pointers in regards to implementing a custom
> partitioner with streaming support or redirect me to related sources?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: SequenceFile and object reuse

2015-11-19 Thread jeff saremi
Sandy, Ryan, Andrew
Thanks very much. I think i now understand it better.
Jeff

From: ryan.blake.willi...@gmail.com
Date: Thu, 19 Nov 2015 06:00:30 +
Subject: Re: SequenceFile and object reuse
To: sandy.r...@cloudera.com; jeffsar...@hotmail.com
CC: user@spark.apache.org

Hey Jeff, in addition to what Sandy said, there are two more reasons that this 
might not be as bad as it seems; I may be incorrect in my understanding though.
First, the "additional step" you're referring to is not likely to be adding any 
overhead; the "extra map" is really just materializing the data once (as 
opposed to zero times), which is what you want (assuming your access pattern 
couldn't be reformulated in the way Sandy described, i.e. where all the objects 
in a partition don't need to be in memory at the same time).
Secondly, even if this was an "extra" map step, it wouldn't add any extra 
stages to a given pipeline, being a "narrow" dependency, so it would likely be 
low-cost anyway.
Let me know if any of the above seems incorrect, thanks!
On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryza  wrote:
Hi Jeff,
Many access patterns simply take the result of hadoopFile and use it to create 
some other object, and thus have no need for each input record to refer to a 
different object.  In those cases, the current API is more performant than an 
alternative that would create an object for each record, because it avoids the 
unnecessary overhead of creating Java objects.  As you've pointed out, this is 
at the expense of making the code more verbose when caching.
-Sandy
On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi  wrote:



So we tried reading a sequencefile in Spark and realized that all our records 
have ended up becoming the same.
THen one of us found this:

Note: Because Hadoop's RecordReader class re-uses the same Writable object for 
each record, directly caching the returned RDD or directly passing it to an 
aggregation or shuffle operation will create many references to the same 
object. If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first copy them using a map function.

Is there anyone that can shed some light on this bizzare behavior and the 
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to 
incur the additional map() as suggested by the above? What format did you use?

thanksJeff


  

Receiver stage fails but Spark application stands RUNNING

2015-11-19 Thread Pierre Van Ingelandt


Hi,


I currently have two Spark Streaming applications (Spark 1.3.1), one using
a custom JMS Receiver and the other one a Kafka Receiver.


Most of the time when a job fails (ie smthg like "Job aborted due to stage
failure: Task 3 in stage 2.0 failed 4 times"), the application gets either

state=FINISHED and finalStatus=FAILED
or state=FAILED & finalStatus=FAILED


But in some cases when receiver stage fails, application keeps RUNNING
state, without processing any data then (since there are no more receiver),
instead of being FAILED.


Are there any explanation for that and is there any way to avoid this
behaviour ?


Thanks much in advance for any insight about that,


Pierre
--   
Accédez aux meilleurs tarifs Air France, gérez vos réservations et 
enregistrez-vous en ligne sur  http://www.airfrance.com  
Find best Air France fares, manage your reservations and check in online at  
http://www.airfrance.com  Les données et renseignements contenus dans ce 
message peuvent être de nature confidentielle et soumis au secret professionnel 
et sont destinés à l'usage exclusif du destinataire dont les coordonnées 
figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous 
demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de 
notifier cette erreur à l'expéditeur et d'effacer immédiatement cette 
communication de votre système. Société Air France - Société anonyme au capital 
de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, 
Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX  
The data and information contained in this message may be confidential and 
subject to professional secrecy and are intended for the exclusive use of the 
recipient at the address shown above. If you receive this message by mistake, 
we ask you not to copy, use or disclose it. Please notify this error to the 
sender immediately and delete this message from your system. Société Air France 
- Limited company with capital of 126,748,775 euros - Bobigny register of 
companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 
Roissy Charles de Gaulle CEDEX  Pensez à l'environnement avant d'imprimer ce 
message.  
Think of the environment before printing this mail.   

Spark streaming and custom partitioning

2015-11-19 Thread Sachin Mousli
Hi,

I would like to implement a custom partitioning algorithm in a streaming
environment, preferably in a generic manner using a single pass. The
only sources I could find mention Apache Kafka which adds a complexity I
would like to avoid since there seem to be no documentation regarding
ways to achieve effective communication between the 2.

Could you give me pointers in regards to implementing a custom
partitioner with streaming support or redirect me to related sources?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross,

This is most likely occurring because YARN is killing containers for
exceeding physical memory limits.  You can make this less likely to happen
by bumping spark.yarn.executor.memoryOverhead to something higher than 10%
of your spark.executor.memory.

-Sandy

On Thu, Nov 19, 2015 at 8:14 AM,  wrote:

> Hmm I guess I do not - I get 'application_1445957755572_0176 does not
> have any log files.’ Where can I enable log aggregation?
>
> On Nov 19, 2015, at 11:07 AM, Ted Yu  wrote:
>
> Do you have YARN log aggregation enabled ?
>
> You can try retrieving log for the container using the following command:
>
> yarn logs -applicationId application_1445957755572_0176
>  -containerId container_1445957755572_0176_01_03
>
> Cheers
>
> On Thu, Nov 19, 2015 at 8:02 AM,  wrote:
>
>> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
>> transforms on a JSON data set that I load into a data frame. The data set
>> is not large (~100GB) and most stages execute without any issues. However,
>> some more complex stages tend to lose executors/nodes regularly. What would
>> cause this to happen? The logs don’t give too much information -
>>
>> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
>> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
>> container_1445957755572_0176_01_03)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
>> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
>> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
>> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
>> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
>> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
>> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
>> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
>> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
>> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
>> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> [Stage 33:===> (117 + 50)
>> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
>> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>>
>>  - Followed by a list of lost tasks on each executor.
>
>
>
>


Re:

2015-11-19 Thread Dean Wampler
If you mean retaining data from past jobs, try running the history server,
documented here:

http://spark.apache.org/docs/latest/monitoring.html

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Nov 19, 2015 at 3:10 AM, 金国栋  wrote:

> I don't really know what you mean by saying historical data, but if you
> have configured logs, so you can always get historical data of jobs,
> stages, tasks etc, unless you delete them.
>
> Best,
> Jelly
>
> 2015-11-19 16:53 GMT+08:00 aman solanki :
>
>> Hi All,
>>
>> I want to know how one can get historical data of jobs,stages,tasks etc
>> of a running spark application.
>>
>> Please share the information regarding the same.
>>
>> Thanks,
>> Aman Solanki
>>
>
>


Re: spark with breeze error of NoClassDefFoundError

2015-11-19 Thread Ted Yu
I don't have Spark 1.4 source code on hand.

You can use the following command:

mvn dependency:tree

to find out the answer to your question.

Cheers

On Wed, Nov 18, 2015 at 10:18 PM, Jack Yang  wrote:

> Back to my question. If  I use “*provided*”, the jar file
> will expect some libraries are provided by the system.
>
> However, the “ *compiled *” is the default setting, which
> means the third-party library will be included inside jar file after
> compiling.
>
> So when I use “*provided*”, the error is they cannot find
> the Class, but with “compiled” the error is IncompatibleClassChangeError.
>
>
>
> Ok, so can someone tell me which version of breeze and breeze-math are
> used in spark 1.4?
>
>
>
> *From:* Zhiliang Zhu [mailto:zchl.j...@yahoo.com]
> *Sent:* Thursday, 19 November 2015 5:10 PM
> *To:* Ted Yu
> *Cc:* Jack Yang; Fengdong Yu; user@spark.apache.org
>
> *Subject:* Re: spark with breeze error of NoClassDefFoundError
>
>
>
> Dear Ted,
>
> I just looked at the link you provided, it is great!
>
>
>
> For my understanding, I could also directly use other Breeze part (except
> spark mllib package linalg ) in spark (scala or java ) program after
> importing Breeze package,
>
> it is right?
>
>
>
> Thanks a lot in advance again!
>
> Zhiliang
>
>
>
>
>
>
>
> On Thursday, November 19, 2015 1:46 PM, Ted Yu 
> wrote:
>
>
>
> Have you looked at
>
> https://github.com/scalanlp/breeze/wiki
>
>
>
> Cheers
>
>
> On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu  wrote:
>
> Dear Jack,
>
>
>
> As is known, Breeze is numerical calculation package wrote by scala ,
> spark mllib also use it as underlying package for algebra usage.
>
> Here I am also preparing to use Breeze for nonlinear equation
> optimization, however, it seemed that I could not find the exact doc or API
> for Breeze except spark linalg package...
>
>
>
> Could you help some to provide me the official doc or API website for
> Breeze ?
>
> Thank you in advance!
>
>
>
> Zhiliang
>
>
>
>
>
>
>
> On Thursday, November 19, 2015 7:32 AM, Jack Yang  wrote:
>
>
>
> If I tried to change “provided” to “compile”.. then the error changed to :
>
>
>
> Exception in thread "main" java.lang.IncompatibleClassChangeError:
> Implementing class
>
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>
> at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> at
> smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> 15/11/19 10:28:29 INFO util.Utils: Shutdown hook called
>
>
>
> Meanwhile, I will prefer to use maven to compile the jar file rather than
> sbt, although it is indeed another option.
>
>
>
> Best regards,
>
> Jack
>
>
>
>
>
>
>
> *From:* Fengdong Yu [mailto:fengdo...@everstring.com
> ]
> *Sent:* Wednesday, 18 November 2015 7:30 PM
> *To:* Jack Yang
> *Cc:* Ted Yu; user@spark.apache.org
> *Subject:* Re: spark with breeze error of NoClassDefFoundError
>
>
>
> The simplest way is remove all “provided” in your pom.
>
>
>
> then ‘sbt assembly” to build your final package. then get rid of ‘—jars’
> because assembly already includes all dependencies.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:
>
>
>
> So weird. Is there anything wrong with the way I made the pom file (I
> labelled them as *provided*)?
>
>
>
> Is there missing jar I forget to add in “--jar”?
>
>
>
> See the trace below:
>
>
>
>
>
>
>
> Exception in thread "main" 

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Here are the parameters related to log aggregation :


  yarn.log-aggregation-enable
  true



  yarn.log-aggregation.retain-seconds
  2592000


  yarn.nodemanager.log-aggregation.compression-type
  gz



  yarn.nodemanager.log-aggregation.debug-enabled
  false



  yarn.nodemanager.log-aggregation.num-log-files-per-app
  30




yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  -1


On Thu, Nov 19, 2015 at 8:14 AM,  wrote:

> Hmm I guess I do not - I get 'application_1445957755572_0176 does not
> have any log files.’ Where can I enable log aggregation?
>
> On Nov 19, 2015, at 11:07 AM, Ted Yu  wrote:
>
> Do you have YARN log aggregation enabled ?
>
> You can try retrieving log for the container using the following command:
>
> yarn logs -applicationId application_1445957755572_0176
>  -containerId container_1445957755572_0176_01_03
>
> Cheers
>
> On Thu, Nov 19, 2015 at 8:02 AM,  wrote:
>
>> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
>> transforms on a JSON data set that I load into a data frame. The data set
>> is not large (~100GB) and most stages execute without any issues. However,
>> some more complex stages tend to lose executors/nodes regularly. What would
>> cause this to happen? The logs don’t give too much information -
>>
>> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
>> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
>> container_1445957755572_0176_01_03)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
>> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
>> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
>> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
>> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
>> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
>> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
>> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
>> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
>> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
>> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> [Stage 33:===> (117 + 50)
>> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
>> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>>
>>  - Followed by a list of lost tasks on each executor.
>
>
>
>


what is algorithm to optimize function with nonlinear constraints

2015-11-19 Thread Zhiliang Zhu
Hi all,
I have some optimization problem, I have googled a lot but still did not get 
the exact algorithm or third-party open package to apply in it.
Its type is like this,
Objective function: f(x1, x2, ..., xn)   (n >= 100, and f may be linear or 
non-linear)Constraint functions:
x1 + x2 + ... + xn = 1,                               1)
b1 * x1 + b2 * x2 + ... + bn * xn = b, 2)
c1 * x1 * x1 + c2 * x2 * x2 + ... + cn * xn * xn = c, 3)                
                  <-  nonlinear constraint 
x1, x2, ..., xn >= 0 .        
To find the solution of x which lets objective function globally or locally the 
biggest.

I was thinking about to apply gradient descent or Levenberg-Marquardt algorithm 
to solve it, however, the two are used for none constraint.I also considered 
Lagrange multiplier method, but the system of gradient equations is nonlinear, 
which seems difficult to solve,
Which algorithm would be proper to apply here, and is there any open package 
like breeze for it?Any comment or link will be helpful. 
Thanks a lot in advance!
Zhiliang