Re: spark-submit stuck and no output in console

2015-11-17 Thread Kayode Odeyemi
Sonal, SparkPi couldn't run as well. Stuck to the screen with no output

hadoop-user@yks-hadoop-m01:/usr/local/spark$ ./bin/run-example SparkPi

On Tue, Nov 17, 2015 at 12:22 PM, Steve Loughran 
wrote:

> 48 hours is one of those kerberos warning times (as is 24h, 72h and 7
> days) 


Does this mean I need to restart the whole Hadoop YARN cluster to reset
kerberos?




-- 
Odeyemi 'Kayode O.
http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde


Working with RDD from Java

2015-11-17 Thread frula00
Hi,
I'm working in Java, with Spark 1.3.1 - I am trying to extract data from the
RDD returned by
org.apache.spark.mllib.clustering.DistributedLDAModel.topicDistributions()
(return type is RDD>). How do I work with it from
within Java, I can't seem to cast it to JavaPairRDD nor JavaRDD and if I try
to collect it it simply returns an Object?

Thank you for your help in advance!

Ivan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-RDD-from-Java-tp25399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Count of streams processed

2015-11-17 Thread Chandra Mohan, Ananda Vel Murugan
HI,

Is it possible to have a running count of number of kafka messages processed in 
a spark streaming application? Thanks

Regards,
Anand.C


Re: large, dense matrix multiplication

2015-11-17 Thread Eilidh Troup
Hi Burak,

That’s interesting. I’ll try and give it a go.

Eilidh

On 14 Nov 2015, at 04:19, Burak Yavuz  wrote:

> Hi,
> 
> The BlockMatrix multiplication should be much more efficient on the current 
> master (and will be available with Spark 1.6). Could you please give that a 
> try if you have the chance?
> 
> Thanks,
> Burak
> 
> On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan 
>  wrote:
> Hi Eilidh
> 
> Because you are multiplying with the transpose you don't have  to necessarily 
> build the right side of the matrix. I hope you see that. You can broadcast 
> blocks of the indexed row matrix to itself and achieve the multiplication.
> 
> But for similarity computation you might want to use some approach like 
> locality sensitive hashing first to identify a bunch of similar customers and 
> then apply cosine similarity on that narrowed down list. That would scale 
> much better than matrix multiplication. You could try the following options 
> for the same.
> 
> https://github.com/soundcloud/cosine-lsh-join-spark
> http://spark-packages.org/package/tdebatty/spark-knn-graphs
> https://github.com/marufaytekin/lsh-spark
> 
> Regards
> Sab
> 
> Hi Sab,
> 
> Thanks for your response. We’re thinking of trying a bigger cluster, because 
> we just started with 2 nodes. What we really want to know is whether the code 
> will scale up with larger matrices and more nodes. I’d be interested to hear 
> how large a matrix multiplication you managed to do?
> 
> Is there an alternative you’d recommend for calculating similarity over a 
> large dataset?
> 
> Thanks,
> Eilidh
> 
> On 13 Nov 2015, at 09:55, Sabarish Sasidharan 
>  wrote:
> 
>> We have done this by blocking but without using BlockMatrix. We used our own 
>> blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is 
>> the size of your block? How much memory are you giving to the executors? I 
>> assume you are running on YARN, if so you would want to make sure your yarn 
>> executor memory overhead is set to a higher value than default.
>> 
>> Just curious, could you also explain why you need matrix multiplication with 
>> transpose? Smells like similarity computation.
>> 
>> Regards
>> Sab
>> 
>> On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup  wrote:
>> Hi,
>> 
>> I’m trying to multiply a large squarish matrix with its transpose. 
>> Eventually I’d like to work with matrices of size 200,000 by 500,000, but 
>> I’ve started off first with 100 by 100 which was fine, and then with 10,000 
>> by 10,000 which failed with an out of memory exception.
>> 
>> I used MLlib and BlockMatrix and tried various block sizes, and also tried 
>> switching disk serialisation on.
>> 
>> We are running on a small cluster, using a CSV file in HDFS as the input 
>> data.
>> 
>> Would anyone with experience of multiplying large, dense matrices in spark 
>> be able to comment on what to try to make this work?
>> 
>> Thanks,
>> Eilidh
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>> India ICT)
>> +++
> 
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread Ted Yu
I am a bit curious:
Hbase depends on hdfs. 
Has hdfs support for Mesos been fully implemented ?

Last time I checked, there was still work to be done. 

Thanks

> On Nov 17, 2015, at 1:06 AM, 임정택  wrote:
> 
> Oh, one thing I missed is, I built Spark 1.4.1 Cluster with 6 nodes of Mesos 
> 0.22.1 H/A (via ZK) cluster.
> 
> 2015-11-17 18:01 GMT+09:00 임정택 :
>> Hi all,
>> 
>> I'm evaluating zeppelin to run driver which interacts with HBase.
>> I use fat jar to include HBase dependencies, and see failures on executor 
>> level.
>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>> 
>> I loaded fat jar via --jars option,
>>  
>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>> 
>> and run driver code using provided SparkContext instance, and see failures 
>> from spark-shell console and executor logs.
>> 
>> below is stack traces,
>> 
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>> at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at scala.Option.foreach(Option.scala:236)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 
>> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 
>> 14)
>> java.lang.ExceptionInInitializerError
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> 

Re: Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-17 Thread Iulian Dragoș
Hi Jo,

I agree that there's something fishy with the cluster dispatcher, I've seen
some issues like that.

I think it actually tries to send all properties as part of
`SPARK_EXECUTOR_OPTS`, which may not be everything that's needed:

https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L375-L377

Can you please open a Jira ticket and describe also the symptoms? This
might be related, or the same issue: SPARK-11280
 and also SPARK-11327


thanks,
iulian




On Tue, Nov 17, 2015 at 2:46 AM, Jo Voordeckers 
wrote:

>
> Hi all,
>
> I'm running the mesos cluster dispatcher, however when I submit jobs with
> things like jvm args, classpath order and UI port aren't added to the
> commandline executed by the mesos scheduler. In fact it only cares about
> the class, jar and num cores/mem.
>
>
> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424
>
> I've made an attempt at adding a few of the args that I believe are useful
> to the MesosClusterScheduler class, which seems to solve my problem.
>
> Please have a look:
>
> https://github.com/apache/spark/pull/9752
>
> Thanks
>
> - Jo Voordeckers
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: Conf Settings in Mesos

2015-11-17 Thread Iulian Dragoș
Hi John,

I don't think this is specific to Mesos.

Note that `spark-defaults.conf` are only defaults. Normally you'd pass your
specific options using `--conf`. Does that work?

iulian


On Thu, Nov 12, 2015 at 3:05 PM, John Omernik  wrote:

> Hey all,
>
> I noticed today that if I take a tgz as my URI for Mesos, that I have to
> repackaged it with my conf settings from where I execute say pyspark for
> the executors to have the right configuration settings.
>
> That is...
>
> If I take a "stock" tgz from makedistribution.sh, unpack it, and then set
> the URI in spark-defaults to be the unmodified tgz as the URI. Change other
> settings in both spark-defaults.conf and spark-env.sh, then run
> ./bin/pyspark from that unpacked directory, I guess I would have thought
> that when the executor spun up, that some sort of magic was happening where
> the conf directory or the conf settings would propagate out to the
> executors (thus making configuration changes easier to manage)
>
> For things to work, I had to unpack the tgz, change conf settings, then
> repackage the tgz with all my conf settings for the tgz in the URI then run
> it. Then it seemed to work.
>
> I have a work around, but I guess, from a usability point of view, it
> would be nice to have tgz that is "binaries" and that when it's run, it
> takes the conf at run time. It would help with managing multiple
> configurations that are using the same binaries (different models/apps etc)
> Instead of having to repackage an tgz for each app, it would just
> propagate...am I looking at this wrong?
>
> John
>
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: spark-submit stuck and no output in console

2015-11-17 Thread Kayode Odeyemi
Our hadoop NFS Gateway seems to be malfunctioning.

I basically restart it. Now spark jobs have resumed successfully.

Problem solved.


Re: Parallelizing operations using Spark

2015-11-17 Thread PhuDuc Nguyen
You should try passing your solr writer into rdd.foreachPartition() for max
parallelism - each partition on each executor will execute the function
passed in.

HTH,
Duc

On Tue, Nov 17, 2015 at 7:36 AM, Susheel Kumar 
wrote:

> Any input/suggestions on parallelizing below operations using Spark over
> Java Thread pooling
> - reading of 100 thousands json files from local file system
> - processing each file content and submitting to Solr as Input document
>
> Thanks,
> Susheel
>
> On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar 
> wrote:
>
>> Hello Spark Users,
>>
>> My first email to spark mailing list and looking forward. I have been
>> working on Solr and in the past have used Java thread pooling to
>> parallelize Solr indexing using SolrJ.
>>
>> Now i am again working on indexing data and this time from JSON files (in
>> 100 thousands) and before I try out parallelizing the operations using
>> Spark (reading each JSON file, post its content to Solr) I wanted to
>> confirm my understanding.
>>
>>
>> By reading json files using wholeTextFiles and then posting the content
>> to Solr
>>
>> - would be similar to what i will achieve using Java multi-threading /
>> thread pooling and using ExecutorFramework  and
>> - what additional other advantages i would get by using Spark (less
>> code...)
>> - How we can parallelize/batch this further? For e.g. In my Java
>> multi-threaded i not only parallelize the reading / data acquisition but
>> also posting in batches in parallel.
>>
>>
>> Below is the code snippet to give you an idea of what i am thinking to
>> start initially.  Please feel free to suggest/correct my understanding and
>> below code structure.
>>
>> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"
>> );
>>
>> JavaSparkContext sc = new JavaSparkContext(conf);
>>
>> JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");
>>
>> rdd.foreach(new VoidFunction>() {
>>
>>
>> @Override
>>
>> public void post(Tuple2 arg0) throws Exception {
>>
>> //post content to Solr
>>
>> arg0._2
>>
>> ...
>>
>> ...
>>
>> }
>>
>> });
>>
>>
>> Thanks,
>>
>> Susheel
>>
>
>


Re: ISDATE Function

2015-11-17 Thread Ted Yu
ISDATE() is currently not supported.
Since it is SQL Server specific, I guess it wouldn't be added to Spark.

On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani  wrote:

> Hi Everyone,
>
>
>  In MSSQL server suppprt "ISDATE()" function is used to fine current
> column values date or not?.  Is any possible to achieve current column
> values date or not?
>
>
> Regards,
> Ravi
>


Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Nikhil Gs
Hello Everyone,

Firstly, thank you so much for the response. In our cluster, we are using
Spark 1.3.0 and our cluster version is CDH 5.4.1. Yes, we are also using
Kerbros in our cluster and the kerberos version is 1.10.3.

The error "*GSS initiate failed [Caused by GSSException: No valid
credentials provided" *was occurring when we are trying to load data from
kafka  topic to hbase by using Spark classes and spark submit job.

My question is, we also have an other project named as XXX in our cluster
which is successfully deployed and its running and the scenario for that
project is, flume + Spark submit + Hbase table. For this scenario, it works
fine in our Kerberos cluster and why not for kafkatopic + Spark Submit +
Hbase table.

Are we doing anything wrong? Not able to figure out? Please suggest us.

Thanks in advance!

Regards,
Nik.

On Tue, Nov 17, 2015 at 4:03 AM, Steve Loughran 
wrote:

>
> On 17 Nov 2015, at 02:00, Nikhil Gs  wrote:
>
> Hello Team,
>
> Below is the error which we are facing in our cluster after 14 hours of
> starting the spark submit job. Not able to understand the issue and why its
> facing the below error after certain time.
>
> If any of you have faced the same scenario or if you have any idea then
> please guide us. To identify the issue, if you need any other info then
> please revert me back with the requirement.Thanks a lot in advance.
>
> *Log Error:  *
>
> 15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
>
> javax.security.sasl.SaslException: *GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]*
>
>
> I keep my list of causes of error messages online:
> https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html
>
> Spark only support long-lived work on a kerberos cluster from 1.5+, with a
> keytab being supplied to the job. Without this, the yarn client grabs some
> tickets at launch time and hangs on until they expire, which for you is 14
> hours
>
> (For anyone using ticket-at-launch auth, know that Spark 1.5.0-1.5.2
> doesnt talk to Hive on a kerberized cluster; some reflection-related issues
> which weren't picked up during testing. 1.5.3 will fix this
>


Re: Count of streams processed

2015-11-17 Thread Cody Koeninger
Sure, just call count on each rdd and track it in your driver however you
want.

If count is called directly on a kafkardd (e.g. createDirectStream, then
foreachRDD before doing any other transformations), it should just be using
the beginning and ending offsets rather than doing any real work.

On Tue, Nov 17, 2015 at 4:48 AM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

> HI,
>
>
>
> Is it possible to have a running count of number of kafka messages
> processed in a spark streaming application? Thanks
>
>
>
> Regards,
>
> Anand.C
>


Re: Parallelizing operations using Spark

2015-11-17 Thread Susheel Kumar
Any input/suggestions on parallelizing below operations using Spark over
Java Thread pooling
- reading of 100 thousands json files from local file system
- processing each file content and submitting to Solr as Input document

Thanks,
Susheel

On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar 
wrote:

> Hello Spark Users,
>
> My first email to spark mailing list and looking forward. I have been
> working on Solr and in the past have used Java thread pooling to
> parallelize Solr indexing using SolrJ.
>
> Now i am again working on indexing data and this time from JSON files (in
> 100 thousands) and before I try out parallelizing the operations using
> Spark (reading each JSON file, post its content to Solr) I wanted to
> confirm my understanding.
>
>
> By reading json files using wholeTextFiles and then posting the content to
> Solr
>
> - would be similar to what i will achieve using Java multi-threading /
> thread pooling and using ExecutorFramework  and
> - what additional other advantages i would get by using Spark (less
> code...)
> - How we can parallelize/batch this further? For e.g. In my Java
> multi-threaded i not only parallelize the reading / data acquisition but
> also posting in batches in parallel.
>
>
> Below is the code snippet to give you an idea of what i am thinking to
> start initially.  Please feel free to suggest/correct my understanding and
> below code structure.
>
> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"
> );
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");
>
> rdd.foreach(new VoidFunction>() {
>
>
> @Override
>
> public void post(Tuple2 arg0) throws Exception {
>
> //post content to Solr
>
> arg0._2
>
> ...
>
> ...
>
> }
>
> });
>
>
> Thanks,
>
> Susheel
>


Re: synchronizing streams of different kafka topics

2015-11-17 Thread Cody Koeninger
Are you using the direct stream? Each batch should contain all of the
unprocessed messages for each topic, unless you're doing some kind of rate
limiting.

On Tue, Nov 17, 2015 at 3:07 AM, Antony Mayi 
wrote:

> Hi,
>
> I have two streams coming from two different kafka topics. the two topics
> contain time related events but are quite asymmetric in volume. I would
> obviously need to process them in sync to get the time related events
> together but with same processing rate if the heavier stream starts
> backlogging the events from the tinier stream would be coming ahead of the
> relevant events that are still in the backlog of the heavy stream.
>
> Is there any way to get the smaller stream processed with slower rate so
> that the relevant events come together with the heavy stream?
>
> Thanks,
> Antony.
>


Re: How to create nested structure from RDD

2015-11-17 Thread Dean Wampler
Crap. Hit send accidentally...

In pseudocode, assuming comma-separated input data:

scala> case class Address(street: String, city: String)
scala> case class User (name: String, address: Address)

scala> val df = sc.textFile("/path/to/stuff").
  map { line =>
val array = line.split(",")   // assume: "name,street,city"
User(array(0), Address(array(1), array(2)))
  }.toDF()

scala> df.printSchema
root
 |-- name: string (nullable = true)
 |-- address: struct (nullable = true)
 ||-- street: string (nullable = true)
 ||-- city: string (nullable = true)


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Tue, Nov 17, 2015 at 11:16 AM, Dean Wampler 
wrote:

> One way to do it, in the Scala API, you would use a tuple or case class
> with nested tuples or case classes and/or primitives. It works fine if you
> convert to a DataFrame, too; you can reference nested elements using dot
> notation. I think in Python it would similarly.
>
> In pseudocode, assuming comma-separated input data:
>
> case class Address(street: String, city: String)
> case class User (name: String, address: Address)
>
> sc.textFile("/path/to/stuff").
>   map { line =>
> line.split(0)
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Tue, Nov 17, 2015 at 11:06 AM, fhussonnois 
> wrote:
>
>> Hi,
>>
>> I need to convert an rdd of RDD[User] to a DataFrame containing a single
>> column named "user". The column "user" should be a nested struct with all
>> User properties.
>>
>> How can I implement this efficiently ?
>>
>> Thank you in advance
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-nested-structure-from-RDD-tp25401.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: how can evenly distribute my records in all partition

2015-11-17 Thread prateek arora
Hi
Thanks
I am new in spark development so can you provide some help to write a
custom partitioner to achieve this.
if you have and link or example to write custom partitioner please provide
to me.

On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> You can write your own custom partitioner to achieve this
>
> Regards
> Sab
> On 17-Nov-2015 1:11 am, "prateek arora" 
> wrote:
>
>> Hi
>>
>> I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i
>> want to reparation this RDD in to 30 partition so every partition  get one
>> record and assigned to one executor .
>>
>> when i used rdd.repartition(30) its repartition my rdd in 30 partition but
>> some partition get 2 record , some get 1 record and some not getting any
>> record .
>>
>> is there any way in spark so i can evenly distribute my record in all
>> partition .
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


How to create nested structure from RDD

2015-11-17 Thread fhussonnois
Hi,

I need to convert an rdd of RDD[User] to a DataFrame containing a single
column named "user". The column "user" should be a nested struct with all
User properties.

How can I implement this efficiently ?

Thank you in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-nested-structure-from-RDD-tp25401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to create nested structure from RDD

2015-11-17 Thread Dean Wampler
One way to do it, in the Scala API, you would use a tuple or case class
with nested tuples or case classes and/or primitives. It works fine if you
convert to a DataFrame, too; you can reference nested elements using dot
notation. I think in Python it would similarly.

In pseudocode, assuming comma-separated input data:

case class Address(street: String, city: String)
case class User (name: String, address: Address)

sc.textFile("/path/to/stuff").
  map { line =>
line.split(0)
dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Tue, Nov 17, 2015 at 11:06 AM, fhussonnois  wrote:

> Hi,
>
> I need to convert an rdd of RDD[User] to a DataFrame containing a single
> column named "user". The column "user" should be a nested struct with all
> User properties.
>
> How can I implement this efficiently ?
>
> Thank you in advance
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-nested-structure-from-RDD-tp25401.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: how can evenly distribute my records in all partition

2015-11-17 Thread Ted Yu
Please take a look at the following for example:

./core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala
./core/src/main/scala/org/apache/spark/Partitioner.scala

Cheers

On Tue, Nov 17, 2015 at 9:24 AM, prateek arora 
wrote:

> Hi
> Thanks
> I am new in spark development so can you provide some help to write a
> custom partitioner to achieve this.
> if you have and link or example to write custom partitioner please
> provide to me.
>
> On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> You can write your own custom partitioner to achieve this
>>
>> Regards
>> Sab
>> On 17-Nov-2015 1:11 am, "prateek arora" 
>> wrote:
>>
>>> Hi
>>>
>>> I have a RDD with 30 record ( Key/value pair ) and running 30 executor .
>>> i
>>> want to reparation this RDD in to 30 partition so every partition  get
>>> one
>>> record and assigned to one executor .
>>>
>>> when i used rdd.repartition(30) its repartition my rdd in 30 partition
>>> but
>>> some partition get 2 record , some get 1 record and some not getting any
>>> record .
>>>
>>> is there any way in spark so i can evenly distribute my record in all
>>> partition .
>>>
>>> Regards
>>> Prateek
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Steve Loughran

On 17 Nov 2015, at 15:39, Nikhil Gs 
> wrote:

Hello Everyone,

Firstly, thank you so much for the response. In our cluster, we are using Spark 
1.3.0 and our cluster version is CDH 5.4.1. Yes, we are also using Kerbros in 
our cluster and the kerberos version is 1.10.3.

The error "GSS initiate failed [Caused by GSSException: No valid credentials 
provided" was occurring when we are trying to load data from kafka  topic to 
hbase by using Spark classes and spark submit job.

My question is, we also have an other project named as XXX in our cluster which 
is successfully deployed and its running and the scenario for that project is, 
flume + Spark submit + Hbase table. For this scenario, it works fine in our 
Kerberos cluster and why not for kafkatopic + Spark Submit + Hbase table.

Are we doing anything wrong? Not able to figure out? Please suggest us.


you are probably into kerberos debug mode. That's not something anyone enjoys 
(*)

There are some options you can turn up for logging in the JVM codebase

https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html

then turn up the org.apache.hadoop.security log to log at DEBUG in the HBase 
server as well as your client code.


(*) This is why I recommend deploying kerberos apps at 09:00 on a tuesday. All 
the big timeout events will happen on tuesday in the morning, afternoon or 
evening, the 24h timeout on wed am, 72h on friday, and the 7 day one the 
following week. You don't want to be fielding support calls on a saturday 
evening as the application — or indeed the entire HDFS filesystem - deployed on 
a friday is failing one node at a time


Distinct on key-value pair of JavaRDD

2015-11-17 Thread Ramkumar V
Hi,

I have JavaRDD. I would like to do distinct only on key but
the normal distinct applies on both key and value. i want to apply only on
key. How to do that ?

Any help is appreciated.

*Thanks*,



Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
I would suggest a couple of things to try

A. Try running the example program with master as local[*]. See if spark
can run locally or not.
B. Check spark master and worker logs.
C. Check if normal hadoop jobs can be run properly on the cluster.
D. Check spark master webui and see health of cluster.
On Nov 17, 2015 4:16 PM, "Kayode Odeyemi"  wrote:

> Sonal, SparkPi couldn't run as well. Stuck to the screen with no output
>
> hadoop-user@yks-hadoop-m01:/usr/local/spark$ ./bin/run-example SparkPi
>
> On Tue, Nov 17, 2015 at 12:22 PM, Steve Loughran 
> wrote:
>
>> 48 hours is one of those kerberos warning times (as is 24h, 72h and 7
>> days) 
>
>
> Does this mean I need to restart the whole Hadoop YARN cluster to reset
> kerberos?
>
>
>
>
> --
> Odeyemi 'Kayode O.
> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>


Re: Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-17 Thread Jo Voordeckers
On Tue, Nov 17, 2015 at 5:16 AM, Iulian Dragoș 
wrote:

> I think it actually tries to send all properties as part of
> `SPARK_EXECUTOR_OPTS`, which may not be everything that's needed:
>
>
> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L375-L377
>
>
Aha that's interesting, I overlooked that line, I'll debug some more today
because I know for sure that those options don't make it onto the
commandline when I was running it in my debugger.


> Can you please open a Jira ticket and describe also the symptoms? This
> might be related, or the same issue: SPARK-11280
>  and also SPARK-11327
> 
>

SPARK-11327  is exactly
my problem, but I don't run docker.

 - Jo

On Tue, Nov 17, 2015 at 2:46 AM, Jo Voordeckers 
> wrote:
>
>>
>> Hi all,
>>
>> I'm running the mesos cluster dispatcher, however when I submit jobs with
>> things like jvm args, classpath order and UI port aren't added to the
>> commandline executed by the mesos scheduler. In fact it only cares about
>> the class, jar and num cores/mem.
>>
>>
>> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424
>>
>> I've made an attempt at adding a few of the args that I believe are
>> useful to the MesosClusterScheduler class, which seems to solve my problem.
>>
>> Please have a look:
>>
>> https://github.com/apache/spark/pull/9752
>>
>> Thanks
>>
>> - Jo Voordeckers
>>
>>
>
>
> --
>
> --
> Iulian Dragos
>
> --
> Reactive Apps on the JVM
> www.typesafe.com
>
>


Issue while Spark Job fetching data from Cassandra DB

2015-11-17 Thread satish chandra j
HI All,
I am getting "*.UnauthorizedException: User  has no SELECT
permission on  or any of its parents*" error
while Spark job is fetching data from Cassandra but could able to save data
into Cassandra with out any issues

Note: With the same user ,  I could able to access and query the
table in CQL UI and code used in Spark Job has been tested in Spark Shell
and it is working fine

Please find the below stack trace

WARN  2015-11-17 07:24:23 org.apache.spark.scheduler.DAGScheduler: Creating
new stage failed due to exception - job: 0

com.datastax.driver.core.exceptions.UnauthorizedException:* User <*
*UserIDXYZ**> has no SELECT permission on  or
any of its parents*

at
com.datastax.driver.core.exceptions.UnauthorizedException.copy(UnauthorizedException.java:36)
~[cassandra-driver-core-2.1.7.1.jar:na]

at
com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:269)
~[cassandra-driver-core-2.1.7.1.jar:na]

at
com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:183)
~[cassandra-driver-core-2.1.7.1.jar:na]

at
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
~[cassandra-driver-core-2.1.7.1.jar:na]

at
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44)
~[cassandra-driver-core-2.1.7.1.jar:na]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[na:1.8.0_51]

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[na:1.8.0_51]

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[na:1.8.0_51]

at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]

at
com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[na:1.8.0_51]

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[na:1.8.0_51]

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[na:1.8.0_51]

   at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]

at
com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:40)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:38)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges$lzycompute(DataSizeEstimates.scala:38)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges(DataSizeEstimates.scala:37)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.dataSizeInBytes$lzycompute(DataSizeEstimates.scala:81)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.dataSizeInBytes(DataSizeEstimates.scala:80)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner.(CassandraRDDPartitioner.scala:39)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.apply(CassandraRDDPartitioner.scala:176)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:144)
~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]

at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
~[spark-core_2.10-1.4.1.1.jar:1.4.1.1]

at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora
Thanks Cody, that's what I thought.
Currently in the cases where I want global ordering, I am doing a collect()
call and going through everything in the client.
I wonder if there is a way to do a global ordered execution across
micro-batches in a betterway?


I am having some trouble with acquiring resources and letting them go after
the iterator in Java.
It might have to do with my resource allocator itself. I will investigate
further and get back to you.

Thanks
Nipun


On Mon, Nov 16, 2015 at 5:11 PM Cody Koeninger  wrote:

> Ordering would be on a per-partition basis, not global ordering.
>
> You typically want to acquire resources inside the foreachpartition
> closure, just before handling the iterator.
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
>
> On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arora 
> wrote:
>
>> Hi,
>> I wanted to understand forEachPartition logic. In the code below, I am
>> assuming the iterator is executing in a distributed fashion.
>>
>> 1. Assuming I have a stream which has timestamp data which is sorted.
>> Will the stringiterator in foreachPartition process each line in order?
>>
>> 2. Assuming I have a static pool of Kafka connections, where should I get
>> a connection from a pool to be used to send data to Kafka?
>>
>> addMTSUnmatched.foreachRDD(
>> new Function() {
>> @Override
>> public Void call(JavaRDD stringJavaRDD) throws Exception 
>> {
>> stringJavaRDD.foreachPartition(
>>
>> new VoidFunction() {
>> @Override
>> public void call(Iterator 
>> stringIterator) throws Exception {
>> while(stringIterator.hasNext()){
>> String str = stringIterator.next();
>> if(OnlineUtils.ESFlag) {
>> OnlineUtils.printToFile(str, 1, 
>> type1_outputFile, OnlineUtils.client);
>> }else{
>> OnlineUtils.printToFile(str, 1, 
>> type1_outputFile);
>> }
>> }
>> }
>> }
>> );
>> return null;
>> }
>> }
>> );
>>
>>
>>
>> Thanks
>>
>> Nipun
>>
>>
>


Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson

On my master

grep native /root/spark/conf/spark-env.sh

SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/l
ib/native/"



$ ls /root/ephemeral-hdfs/lib/native/

libhadoop.a   libhadoop.solibhadooputils.a  libsnappy.so
libsnappy.so.1.1.3  Linux-i386-32

libhadooppipes.a  libhadoop.so.1.0.0  libhdfs.a libsnappy.so.1
Linux-amd64-64


From:  Andrew Davidson 
Date:  Tuesday, November 17, 2015 at 2:29 PM
To:  "user @spark" 
Subject:  Re: WARN LoadSnappy: Snappy native library not loaded

> I forgot to mention. I am using spark-1.5.1-bin-hadoop2.6
> 
> From:  Andrew Davidson 
> Date:  Tuesday, November 17, 2015 at 2:26 PM
> To:  "user @spark" 
> Subject:  Re: WARN LoadSnappy: Snappy native library not loaded
> 
>> FYI
>> 
>> After 17 min. only 26112/228155 have succeeded
>> 
>> This seems very slow
>> 
>> Kind regards
>> 
>> Andy
>> 
>> 
>> 
>> From:  Andrew Davidson 
>> Date:  Tuesday, November 17, 2015 at 2:22 PM
>> To:  "user @spark" 
>> Subject:  WARN LoadSnappy: Snappy native library not loaded
>> 
>> 
>>> I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I
>>> have 3 slaves. In general I am running into trouble even with small work
>>> loads. I am using IPython notebooks running on my spark cluster.
>>> Everything is painfully slow. I am using the standAlone cluster manager.
>>> I noticed that I am getting the following warning on my driver console.
>>> Any idea what the problem might be?
>>> 
>>> 
>>> 
>>> 15/11/17 22:01:59 WARN MetricsSystem: Using default name DAGScheduler for
>>> source because spark.app.id is not set.
>>> 15/11/17 22:03:05 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 15/11/17 22:03:05 WARN LoadSnappy: Snappy native library not loaded
>>> 
>>> 
>>> 
>>> Here is an overview of my POS app. I have a file on hdfs containing about
>>> 5000 twitter status strings.
>>> 
>>> tweetStrings = sc.textFile(dataURL)
>>> 
>>> jTweets = (tweetStrings.map(lambda x: json.loads(x)).take(10))
>>> 
>>> 
>>> Generated the following error ³error occurred while calling
>>> o78.partitions.: java.lang.OutOfMemoryError: GC overhead limit exceeded²
>>> 
>>> Any idea what we need to do to improve new spark user¹s out of the box
>>> experience?
>>> 
>>> Kind regards
>>> 
>>> Andy
>>> 
>>> export PYSPARK_PYTHON=python3.4
>>> export PYSPARK_DRIVER_PYTHON=python3.4
>>> export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"
>>> 
>>> MASTER_URL=spark://ec2-55-218-207-122.us-west-1.compute.amazonaws.com:7077
>>> 
>>> 
>>> numCores=2
>>> $SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores
>>> $numCores $*




pyspark ML pipeline with shared data

2015-11-17 Thread Dominik Dahlem
Hi all,

I'm working on a pipeline for collaborative filtering. Taking the movielens
example, I have a data frame with the columns 'userID', 'movieID', and
'rating'. I would like to transform the ratings before calling ALS and
denormalise after. I implemented two transformers to do this, but I'm
wondering whether there is a better way than using a global variable to
hold the rowMeans of the utility matrix to share the normalisation vector
across both transformers? The pipeline gets more complicated if the
normalisation is done over columns and clustering of the userIDs is used
prior to calling ALS. Any help is greatly appreciated.

Thank you,
Dominik


The code snippets are as follows below. I'm using the pyspark.ml APIs:

# global variable
rowMeans = None

# Transformers
class Normaliser(Transformer, HasInputCol, HasOutputCol):
@keyword_only
def __init__(self, inputCol='rating', outputCol='normalised'):
super(Normaliser, self).__init__()
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None):
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def _transform(self, df):
global rowMeans
rowMeans = df.groupBy('userID') \
 .agg(F.mean(self.inputCol).alias('mean'))
return df.join(rowMeans, 'userID') \
 .select(df['*'], (df[self.inputCol] -
rowMeans['mean']).alias(self.outputCol))


class DeNormaliser(Transformer, HasInputCol, HasOutputCol):
@keyword_only
def __init__(self, inputCol='normalised', outputCol='rating'):
super(DeNormaliser, self).__init__()
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None):
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def _transform(self, df):
return df.join(rowMeans, 'userID') \
 .select(df['*'], (df[self.getInputCol()] +
rowMeans['mean']) \
 .alias(self.getOutputCol()))


# setting up the ML pipeline
rowNormaliser = Normaliser(inputCol='rating', outputCol='rowNorm')
als = ALS(userCol='userID', itemCol='movieID', ratingCol='rowNorm')
rowDeNormaliser = DeNormaliser(inputCol='prediction',
outputCol='denormPrediction')

pipeline = Pipeline(stages=[rowNormaliser, als, rowDeNormaliser])
evaluator = RegressionEvaluator(predictionCol='denormPrediction',
labelCol='rating')


Re: how can evenly distribute my records in all partition

2015-11-17 Thread prateek arora
Hi
I am trying to implement custom partitioner using this link
http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
( in link example key value is from 0 to (noOfElement - 1))

but not able to understand how i  implement  custom partitioner  in my case:

my parent RDD have 4 partition and RDD key is : TimeStamp and Value is JPEG
Byte Array


Regards
Prateek


On Tue, Nov 17, 2015 at 9:28 AM, Ted Yu  wrote:

> Please take a look at the following for example:
>
> ./core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala
> ./core/src/main/scala/org/apache/spark/Partitioner.scala
>
> Cheers
>
> On Tue, Nov 17, 2015 at 9:24 AM, prateek arora  > wrote:
>
>> Hi
>> Thanks
>> I am new in spark development so can you provide some help to write a
>> custom partitioner to achieve this.
>> if you have and link or example to write custom partitioner please
>> provide to me.
>>
>> On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> You can write your own custom partitioner to achieve this
>>>
>>> Regards
>>> Sab
>>> On 17-Nov-2015 1:11 am, "prateek arora" 
>>> wrote:
>>>
 Hi

 I have a RDD with 30 record ( Key/value pair ) and running 30 executor
 . i
 want to reparation this RDD in to 30 partition so every partition  get
 one
 record and assigned to one executor .

 when i used rdd.repartition(30) its repartition my rdd in 30 partition
 but
 some partition get 2 record , some get 1 record and some not getting any
 record .

 is there any way in spark so i can evenly distribute my record in all
 partition .

 Regards
 Prateek



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>
>


Re: Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-17 Thread Jo Voordeckers
Hi Tim,

I've done more forensics on this bug, see my comment here:

https://issues.apache.org/jira/browse/SPARK-11327?focusedCommentId=15009843=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15009843


- Jo Voordeckers


On Tue, Nov 17, 2015 at 12:01 PM, Timothy Chen  wrote:

> Hi Jo,
>
> Thanks for the links, I would expected the properties to be in
> scheduler properties but I need to double check.
>
> I'll be looking into these problems this week.
>
> Tim
>
> On Tue, Nov 17, 2015 at 10:28 AM, Jo Voordeckers
>  wrote:
> > On Tue, Nov 17, 2015 at 5:16 AM, Iulian Dragoș <
> iulian.dra...@typesafe.com>
> > wrote:
> >>
> >> I think it actually tries to send all properties as part of
> >> `SPARK_EXECUTOR_OPTS`, which may not be everything that's needed:
> >>
> >>
> >>
> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L375-L377
> >>
> >
> > Aha that's interesting, I overlooked that line, I'll debug some more
> today
> > because I know for sure that those options don't make it onto the
> > commandline when I was running it in my debugger.
> >
> >>
> >> Can you please open a Jira ticket and describe also the symptoms? This
> >> might be related, or the same issue: SPARK-11280 and also SPARK-11327
> >
> >
> > SPARK-11327 is exactly my problem, but I don't run docker.
> >
> >  - Jo
> >
> >> On Tue, Nov 17, 2015 at 2:46 AM, Jo Voordeckers <
> jo.voordeck...@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> Hi all,
> >>>
> >>> I'm running the mesos cluster dispatcher, however when I submit jobs
> with
> >>> things like jvm args, classpath order and UI port aren't added to the
> >>> commandline executed by the mesos scheduler. In fact it only cares
> about the
> >>> class, jar and num cores/mem.
> >>>
> >>>
> >>>
> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424
> >>>
> >>> I've made an attempt at adding a few of the args that I believe are
> >>> useful to the MesosClusterScheduler class, which seems to solve my
> problem.
> >>>
> >>> Please have a look:
> >>>
> >>> https://github.com/apache/spark/pull/9752
> >>>
> >>> Thanks
> >>>
> >>> - Jo Voordeckers
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> --
> >> Iulian Dragos
> >>
> >> --
> >> Reactive Apps on the JVM
> >> www.typesafe.com
> >>
> >
>


Re: Working with RDD from Java

2015-11-17 Thread Sabarish Sasidharan
You can also do rdd.toJavaRDD(). Pls check the API docs

Regards
Sab
On 18-Nov-2015 3:12 am, "Bryan Cutler"  wrote:

> Hi Ivan,
>
> Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the
> topic distributions called javaTopicDistributions() that returns a
> JavaPairRDD.  If you aren't able to upgrade, you can check out the
> conversion used here
> https://github.com/apache/spark/blob/v1.4.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala#L350
>
> -bryan
>
> On Tue, Nov 17, 2015 at 3:06 AM, frula00 
> wrote:
>
>> Hi,
>> I'm working in Java, with Spark 1.3.1 - I am trying to extract data from
>> the
>> RDD returned by
>> org.apache.spark.mllib.clustering.DistributedLDAModel.topicDistributions()
>> (return type is RDD>). How do I work with it
>> from
>> within Java, I can't seem to cast it to JavaPairRDD nor JavaRDD and if I
>> try
>> to collect it it simply returns an Object?
>>
>> Thank you for your help in advance!
>>
>> Ivan
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-RDD-from-Java-tp25399.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


RE: Any way to get raw score from MultilayerPerceptronClassificationModel ?

2015-11-17 Thread Ulanov, Alexander
Hi Robert,

Raw scores are not available through the public API. It would be great to add 
this feature, it seems that we overlooked it. 

The simple way to access the raw predictions currently would be to create a 
wrapper for mlpModel. This wrapper should be defined in [ml] package. One need 
to get weights from MultilayerPerceptronClassificationModel to instantiate this 
wrapper. 

The better way would be to write a new implementation of MLP model that will 
extend Classifier (instead of Predictor). It will be a total copy of the 
current MultilayerPerceptronClassificationModel except that one would need to 
implement "predictRaw" and " raw2prediction". Eventually, it might replace the 
current implementation.

Best regards, Alexander

-Original Message-
From: Robert Dodier [mailto:robert.dod...@gmail.com] 
Sent: Tuesday, November 17, 2015 1:38 PM
To: user@spark.apache.org
Subject: Any way to get raw score from MultilayerPerceptronClassificationModel ?

Hi,

I'd like to get the raw prediction score from a 
MultilayerPerceptronClassificationModel. It appears that the 'predict'
method only returns the argmax of the largest score in the output layer (line 
200 in MultilayerPerceptronClassificationModel.scala in Spark 1.5.2).

Is there any way to get the raw score? It is computed as
mlpModel.predict(features) in the source code. It appears that I can't access 
mlpModel since it is private in MultilayerPerceptronClassificationModel so I 
couldn't just grab mlpModel and call predict on it. Is there another way?

Thanks for any light you can shed on this question.

Robert Dodier

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



Re: Any way to get raw score from MultilayerPerceptronClassificationModel ?

2015-11-17 Thread Robert Dodier
On Tue, Nov 17, 2015 at 2:36 PM, Ulanov, Alexander
 wrote:

> Raw scores are not available through the public API.
> It would be great to add this feature, it seems that we overlooked it.

OK, thanks for the info.

> The better way would be to write a new implementation of MLP
> model that will extend Classifier (instead of Predictor).

Actually I think it would be better still to extend ProbabilisticClassifier,
since MLP classifier outputs are posterior class probabilities.

best,

Robert Dodier

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread njoshi
I am testing the LogisticRegression performance on a synthetically generated
data. The weights I have as input are

   w = [2, 3, 4]

with no intercept and three features. After training on 1000 synthetically
generated datapoint assuming random normal distribution for each, the Spark
LogisticRegression model I obtain has weights as

 [6.005520656096823,9.35980263762698,12.203400879214152]

I can see that each weight is scaled by a factor close to '3' w.r.t. the
original values. I am unable to guess the reason behind this. The code is
simple enough as


/*
 * Logistic Regression model
 */
val lr = new LogisticRegression()
  .setMaxIter(50)
  .setRegParam(0.001)
  .setElasticNetParam(0.95)
  .setFitIntercept(false)

val lrModel = lr.fit(trainingData)


println(s"${lrModel.weights}")



I would greatly appreciate if someone could shed some light on what's fishy
here.

with kind regards, Nikhil




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney
so I have the following...

broadcast some stuff
cache an rdd
do a bunch of stuff, eventually calling actions which reduce it to an
acceptable size

I'm getting an OOM on the driver (well, GC is getting out of control),
largely because I have a lot of partitions and it looks like the job
history is getting too large. ttl is an option, but the downside is that it
will also delete the rdd...this isn't really what I want. what I want is to
keep my in memory data structures (the rdd, broadcast variable, etc) but
get rid of the old metadata that I don't need anymore (ie tasks that have
executed).

Is there a way to achieve this?


Re: kafka streaminf 1.5.2 - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2015-11-17 Thread Tathagata Das
Are you creating a fat assembly jar with spark-streaming-kafka included and
using that to run your code? If yes, I am not sure why it is not finding
it. If not, then make sure that your framework places the
spark-stremaing-kafka jara in the runtime classpath.

On Tue, Nov 17, 2015 at 6:04 PM, tim_b123  wrote:

>
> Hi,
>
> I have a spark kafka streaming application that works when I run with a
> local spark context, but not with a remote one.
>
> My code consists of:
> 1.  A spring-boot application that creates the context
> 2.  A shaded jar file containing all of my spark code
>
> On my pc (windows), I have a spark (1.5.2) master and worker running
> (spark-1.5.2-bin-hadoop2.6).
>
> The entry point for my application is the start() method.
>
> The code is:
>
>   @throws(classOf[Exception])
>   def start {
> val ssc: StreamingContext = createStreamingContext
>
> val messagesRDD = createKafkaDStream(ssc, "myTopic", 2)
>
> def datasRDD = messagesRDD.map((line : String) =>
> MapFunctions.lineToSparkEventData(line))
>
> def count = datasRDD.count()
>datasRDD.print(1)
>
> ssc.start
> ssc.awaitTermination
>   }
>
>
>   private def createStreamingContext: StreamingContext = {
> System.setProperty(HADOOP_HOME_DIR, configContainer.getHadoopHomeDir)
> System.setProperty("spark.streaming.concurrentJobs",
> String.valueOf(configContainer.getStreamingConcurrentJobs))
> def sparkConf = createSparkConf()
>
> val ssc = new StreamingContext(sparkConf,
> Durations.seconds(configContainer.getStreamingContextDurationSeconds))
> ssc.sparkContext.setJobGroup("main_streaming_job", "streaming context
> start")
> ssc.sparkContext.setLocalProperty("spark.scheduler.pool",
> "real_time_pool")
> ssc
>   }
>
>
>   private def createSparkConf() : SparkConf = {
>   def masterString = "spark://<>:7077"
> def conf = new
> SparkConf().setMaster(masterString).setAppName("devAppRem")  // This is not
> working
> //def conf = new
> SparkConf().setMaster("local[4]").setAppName("devAppLocal")  // This IS
> working
>
> conf.set("spark.scheduler.allocation.file",
> "D:\\valid_path_to\\fairscheduler.xml");
>
>
> val pathToShadedApplicationJar: String =
> configContainer.getApplicationJarPaths.get(0)
> val jars: Array[String] = Array[String](pathToShadedApplicationJar)
> conf.setJars(jars)
>
> conf.set("spark.scheduler.mode", "FAIR")
>
>   }
>
>
>   private def createKafkaDStream(ssc: StreamingContext, topics: String,
> numThreads: Int): DStream[String] = {
> val zkQuorum: String = configContainer.getZkQuorum
> val groupId: String = configContainer.getGroupId
> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
> def lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap,
> StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
> lines
>   }
> }
>
>
> The Error that I get is:
>
> 2015-11-18 12:58:33.755  WARN 6132 --- [result-getter-2]
> o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 2.0 (TID
> 70, 169.254.95.56): java.io.IOException: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaReceiver
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
> at
>
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
> at
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:
> 

Spark build error

2015-11-17 Thread 金国栋
Hi!

I tried to build spark source code from github, and I successfully built it
from command line using `*sbt/sbt assembly*`. While I encountered an error
when compiling the project in Intellij IDEA(V14.1.5).


The error log is below:
*Error:scala: *
* while compiling:
/Users/ray/Documents/P01_Project/Spark-Github/spark/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala*
*during phase: jvm*
 library version: version 2.10.5
compiler version: version 2.10.5
  reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
-classpath

RE: TightVNC - Application Monitor (right pane)

2015-11-17 Thread Tim Barthram
Hi,

I have a spark kafka streaming application that works when I run with a local 
spark context, but not with a remote one.

My code consists of:
1.  A spring-boot application that creates the context
2.  A shaded jar file containing all of my spark code

On my pc (windows), I have a spark (1.5.1) master and worker running.

The entry point for my application is the start() method.

The code is:

  @throws(classOf[Exception])
  def start {
val ssc: StreamingContext = createStreamingContext

val messagesRDD = createKafkaDStream(ssc, "myTopic", 2)

def datasRDD = messagesRDD.map((line : String) => 
MapFunctions.lineToSparkEventData(line))

def count = datasRDD.count()
   datasRDD.print(1)

ssc.start
ssc.awaitTermination
  }


  private def createStreamingContext: StreamingContext = {
System.setProperty(HADOOP_HOME_DIR, configContainer.getHadoopHomeDir)
System.setProperty("spark.streaming.concurrentJobs", 
String.valueOf(configContainer.getStreamingConcurrentJobs))
def sparkConf = createSparkConf()

val ssc = new StreamingContext(sparkConf, 
Durations.seconds(configContainer.getStreamingContextDurationSeconds))
ssc.sparkContext.setJobGroup("main_streaming_job", "streaming context 
start")
ssc.sparkContext.setLocalProperty("spark.scheduler.pool", "real_time_pool")
ssc
  }


  private def createSparkConf() : SparkConf = {
  def masterString = "spark://<>:7077"
def conf = new SparkConf().setMaster(masterString).setAppName("devAppRem")  
// This is not working
//def conf = new 
SparkConf().setMaster("local[4]").setAppName("devAppLocal")  // This IS working

conf.set("spark.scheduler.allocation.file", 
"D:\\valid_path_to\\fairscheduler.xml");


val pathToShadedApplicationJar: String = 
configContainer.getApplicationJarPaths.get(0)
val jars: Array[String] = Array[String](pathToShadedApplicationJar)
conf.setJars(jars)

conf.set("spark.scheduler.mode", "FAIR")

  }


  private def createKafkaDStream(ssc: StreamingContext, topics: String, 
numThreads: Int): DStream[String] = {
val zkQuorum: String = configContainer.getZkQuorum
val groupId: String = configContainer.getGroupId
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap, 
StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
lines
  }
}


The Error that I get is:

2015-11-18 10:41:19.191  WARN 3044 --- [result-getter-3] 
o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 2.0 (TID 70, 
169.254.95.56): java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.KafkaReceiver
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
at 
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.KafkaReceiver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread DB Tsai
How do you compute the probability given the weights? Also, given a
probability, you need to sample positive and negative based on the
probability, and how do you do this? I'm pretty sure that the LoR will
give you correct weights, and please see the
generateMultinomialLogisticInput  in
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Nov 17, 2015 at 4:11 PM, njoshi  wrote:
> I am testing the LogisticRegression performance on a synthetically generated
> data. The weights I have as input are
>
>w = [2, 3, 4]
>
> with no intercept and three features. After training on 1000 synthetically
> generated datapoint assuming random normal distribution for each, the Spark
> LogisticRegression model I obtain has weights as
>
>  [6.005520656096823,9.35980263762698,12.203400879214152]
>
> I can see that each weight is scaled by a factor close to '3' w.r.t. the
> original values. I am unable to guess the reason behind this. The code is
> simple enough as
>
>
> /*
>  * Logistic Regression model
>  */
> val lr = new LogisticRegression()
>   .setMaxIter(50)
>   .setRegParam(0.001)
>   .setElasticNetParam(0.95)
>   .setFitIntercept(false)
>
> val lrModel = lr.fit(trainingData)
>
>
> println(s"${lrModel.weights}")
>
>
>
> I would greatly appreciate if someone could shed some light on what's fishy
> here.
>
> with kind regards, Nikhil
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread Tathagata Das
Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?

On Tue, Nov 17, 2015 at 5:34 PM, swetha  wrote:

>
>
> Hi,
>
> I see  java.lang.NoClassDefFoundError after changing the Streaming job
> version to 1.5.2. Any idea as to why this is happening? Following are my
> dependencies and the error that I get.
>
>   
> org.apache.spark
> spark-core_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-streaming_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-sql_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-hive_2.10
> ${sparkVersion}
> provided
> 
>
>
>
> 
> org.apache.spark
> spark-streaming-kafka_2.10
> ${sparkVersion}
> 
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/streaming/StreamingContext
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
> at java.lang.Class.getMethod0(Class.java:3010)
> at java.lang.Class.getMethod(Class.java:1776)
> at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.StreamingContext
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha


Hi,

I see  java.lang.NoClassDefFoundError after changing the Streaming job
version to 1.5.2. Any idea as to why this is happening? Following are my
dependencies and the error that I get.

  
org.apache.spark
spark-core_2.10
${sparkVersion}
provided




org.apache.spark
spark-streaming_2.10
${sparkVersion}
provided




org.apache.spark
spark-sql_2.10
${sparkVersion}
provided




org.apache.spark
spark-hive_2.10
${sparkVersion}
provided





org.apache.spark
spark-streaming-kafka_2.10
${sparkVersion}



Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/StreamingContext
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
at java.lang.Class.getMethod0(Class.java:3010)
at java.lang.Class.getMethod(Class.java:1776)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.StreamingContext
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



kafka streaminf 1.5.2 - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2015-11-17 Thread tim_b123

Hi,
 
I have a spark kafka streaming application that works when I run with a
local spark context, but not with a remote one.
 
My code consists of: 
1.  A spring-boot application that creates the context
2.  A shaded jar file containing all of my spark code
 
On my pc (windows), I have a spark (1.5.2) master and worker running
(spark-1.5.2-bin-hadoop2.6).
 
The entry point for my application is the start() method.
 
The code is:
 
  @throws(classOf[Exception])
  def start {
val ssc: StreamingContext = createStreamingContext
 
val messagesRDD = createKafkaDStream(ssc, "myTopic", 2)
 
def datasRDD = messagesRDD.map((line : String) =>
MapFunctions.lineToSparkEventData(line))
 
def count = datasRDD.count()
   datasRDD.print(1)
 
ssc.start
ssc.awaitTermination
  }
 
 
  private def createStreamingContext: StreamingContext = {
System.setProperty(HADOOP_HOME_DIR, configContainer.getHadoopHomeDir)
System.setProperty("spark.streaming.concurrentJobs",
String.valueOf(configContainer.getStreamingConcurrentJobs))
def sparkConf = createSparkConf()
 
val ssc = new StreamingContext(sparkConf,
Durations.seconds(configContainer.getStreamingContextDurationSeconds))
ssc.sparkContext.setJobGroup("main_streaming_job", "streaming context
start")
ssc.sparkContext.setLocalProperty("spark.scheduler.pool",
"real_time_pool")
ssc
  }
 
 
  private def createSparkConf() : SparkConf = {
  def masterString = "spark://<>:7077"
def conf = new
SparkConf().setMaster(masterString).setAppName("devAppRem")  // This is not
working
//def conf = new
SparkConf().setMaster("local[4]").setAppName("devAppLocal")  // This IS
working
 
conf.set("spark.scheduler.allocation.file",
"D:\\valid_path_to\\fairscheduler.xml");
 
 
val pathToShadedApplicationJar: String =
configContainer.getApplicationJarPaths.get(0)
val jars: Array[String] = Array[String](pathToShadedApplicationJar)
conf.setJars(jars)
 
conf.set("spark.scheduler.mode", "FAIR")
 
  }
 
 
  private def createKafkaDStream(ssc: StreamingContext, topics: String,
numThreads: Int): DStream[String] = {
val zkQuorum: String = configContainer.getZkQuorum
val groupId: String = configContainer.getGroupId
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap,
StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
lines
  }
}
 
 
The Error that I get is:
 
2015-11-18 12:58:33.755  WARN 6132 --- [result-getter-2]
o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 2.0 (TID
70, 169.254.95.56): java.io.IOException: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaReceiver
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaReceiver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at

Re: Calculating Timeseries Aggregation

2015-11-17 Thread Tathagata Das
For this sort of long term aggregations you should use a dedicated data
storage systems. Like a database, or a key-value store. Spark Streaming
would just aggregate and push the necessary data to the data store.

TD

On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta 
wrote:

> Hi,
>
> I am working on requirement of calculating real time metrics and building
> prototype  on Spark streaming. I need to build aggregate at Seconds,
> Minutes, Hours and Day level.
>
> I am not sure whether I should calculate all these aggregates as
> different Windowed function on input DStream or shall I use
> updateStateByKey function for the same. If I have to use updateStateByKey
> for these time series aggregation, how can I remove keys from the state
> after different time lapsed?
>
> Please suggest.
>
> Regards
> SM
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark with breeze error of NoClassDefFoundError

2015-11-17 Thread Ted Yu
Looking in local maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue
:

jar tvf
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar
| grep !$
jar tvf
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar
| grep DefaultArrayValue
   369 Wed Mar 19 11:18:32 PDT 2014
breeze/storage/DefaultArrayValue$mcZ$sp$class.class
   309 Wed Mar 19 11:18:32 PDT 2014
breeze/storage/DefaultArrayValue$mcJ$sp.class
  2233 Wed Mar 19 11:18:32 PDT 2014
breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class

Can you show the complete stack trace ?

FYI

On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang  wrote:

> Hi all,
>
> I am using spark 1.4.0, and building my codes using maven.
>
> So in one of my scala, I used:
>
>
>
> import breeze.linalg._
>
> val v1 = new breeze.linalg.SparseVector(commonVector.indices,
> commonVector.values, commonVector.size)
>
> val v2 = new breeze.linalg.SparseVector(commonVector2.indices,
> commonVector2.values, commonVector2.size)
>
> println (v1.dot(v2) / (norm(v1) * norm(v2)) )
>
>
>
>
>
>
>
> in my pom.xml file, I used:
>
> 
>
>
> org.scalanlp
>
>
> breeze-math_2.10
>
>  0.4
>
>  *provided*
>
>   
>
>
>
>   
>
>
> org.scalanlp
>
>
> breeze_2.10
>
>  0.11.2
>
>  *provided*
>
>   
>
>
>
>
>
> When submit, I included breeze jars (breeze_2.10-0.11.2.jar
> breeze-math_2.10-0.4.jar breeze-natives_2.10-0.11.2.jar
> breeze-process_2.10-0.3.jar) using “--jar” arguments, although I doubt it
> is necessary to do that.
>
>
>
> however, the error is
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> breeze/storage/DefaultArrayValue
>
>
>
> Any thoughts?
>
>
>
>
>
>
>
> Best regards,
>
> Jack
>
>
>


Additional Master daemon classpath

2015-11-17 Thread Michal Klos
Hi,

We are running a Spark Standalone cluster on EMR (note: not using YARN) and
are trying to use S3 w/ EmrFS as our event logging directory.

We are having difficulties with a ClassNotFoundException on EmrFileSystem
when we navigate to the event log screen. This is to be expected as the
EmrFs jars are not on the classpath.

But -- I have not been able to figure out a way to add additional classpath
jars to the start-up of the Master daemon. SPARK_CLASSPATH has been
deprecated, and looking around at spark-class, etc.. everything seems to be
pretty locked down.

Do I have to shove everything into the assembly jar?

Am I missing a simple way to add classpath to the daemons?

thanks,
Michal


Re: How to properly read the first number lines of file into a RDD

2015-11-17 Thread Zhiliang Zhu
Thanks a lot for your reply.I have also worked it out by some other ways. In 
fact, firstly I was thinking about using filter to do it but failed.  


 On Monday, November 9, 2015 9:52 PM, Akhil Das 
 wrote:
   

 ​There's multiple way to achieve this:
1. Read the N lines from the driver and then do a sc.parallelize(nlines) to 
create an RDD out of it.2. Create an RDD with N+M, do a take on N and then 
broadcast or parallelize the returning list.3. Something like this if the file 
is in hdfs:
    val n_f = (5,file_name)     val n_lines = sc.parallelize(Array(n_f))     
val n_linesRDD = n_lines.map(n => {     //Read and return 5 lines (n._1) from 
the file (n._2)
     }) ​
ThanksBest Regards
On Thu, Oct 29, 2015 at 9:51 PM, Zhiliang Zhu  
wrote:

Hi All,
There is some file with line number N + M,, as I need to read the first N lines 
into one RDD .
1. i) read all the N + M lines as one RDD, ii) select the RDD's top N rows, may 
be some one solution;2. if introduced some broadcast variable set N, then it is 
used to decide while map the file RDD. Only map its first N rows, this may 
notwork, however.
Is there some better solution?
Thank you,Zhiliang




  

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread Nikhil Joshi
Hi,

Wonderful. I was sampling the output, but with a bug. Your comment brought
the realization :). I was indeed victimized by the complete separability
issue :).

Thanks a lot.
with regards,
Nikhil

On Tue, Nov 17, 2015 at 5:26 PM, DB Tsai  wrote:

> How do you compute the probability given the weights? Also, given a
> probability, you need to sample positive and negative based on the
> probability, and how do you do this? I'm pretty sure that the LoR will
> give you correct weights, and please see the
> generateMultinomialLogisticInput  in
>
> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Tue, Nov 17, 2015 at 4:11 PM, njoshi  wrote:
> > I am testing the LogisticRegression performance on a synthetically
> generated
> > data. The weights I have as input are
> >
> >w = [2, 3, 4]
> >
> > with no intercept and three features. After training on 1000
> synthetically
> > generated datapoint assuming random normal distribution for each, the
> Spark
> > LogisticRegression model I obtain has weights as
> >
> >  [6.005520656096823,9.35980263762698,12.203400879214152]
> >
> > I can see that each weight is scaled by a factor close to '3' w.r.t. the
> > original values. I am unable to guess the reason behind this. The code is
> > simple enough as
> >
> >
> > /*
> >  * Logistic Regression model
> >  */
> > val lr = new LogisticRegression()
> >   .setMaxIter(50)
> >   .setRegParam(0.001)
> >   .setElasticNetParam(0.95)
> >   .setFitIntercept(false)
> >
> > val lrModel = lr.fit(trainingData)
> >
> >
> > println(s"${lrModel.weights}")
> >
> >
> >
> > I would greatly appreciate if someone could shed some light on what's
> fishy
> > here.
> >
> > with kind regards, Nikhil
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>



-- 

*Nikhil Joshi*Princ Data Scientist
*Aol*PLATFORMS.
*395 Page Mill Rd, *Palo Alto
, CA
 94306-2024
vvmr: 8894737


Re: Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha kasireddy
This error I see locally.

On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das  wrote:

> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>
> On Tue, Nov 17, 2015 at 5:34 PM, swetha  wrote:
>
>>
>>
>> Hi,
>>
>> I see  java.lang.NoClassDefFoundError after changing the Streaming job
>> version to 1.5.2. Any idea as to why this is happening? Following are my
>> dependencies and the error that I get.
>>
>>   
>> org.apache.spark
>> spark-core_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-streaming_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-sql_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>> 
>> org.apache.spark
>> spark-hive_2.10
>> ${sparkVersion}
>> provided
>> 
>>
>>
>>
>> 
>> org.apache.spark
>> spark-streaming-kafka_2.10
>> ${sparkVersion}
>> 
>>
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/streaming/StreamingContext
>> at java.lang.Class.getDeclaredMethods0(Native Method)
>> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
>> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
>> at java.lang.Class.getMethod0(Class.java:3010)
>> at java.lang.Class.getMethod(Class.java:1776)
>> at
>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.streaming.StreamingContext
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Issue while Spark Job fetching data from Cassandra DB

2015-11-17 Thread satish chandra j
HI All,
I am getting "*.UnauthorizedException: User  has no SELECT
permission on  or any of its parents*" error
while Spark job is fetching data from Cassandra but could able to save data
into Cassandra with out any issues

Note: With the same user ,  I could able to access and query the
table in CQL UI and code used in Spark Job has been tested in Spark Shell
and it is working fine

Regards,
Satish Chandra

On Tue, Nov 17, 2015 at 11:45 PM, satish chandra j  wrote:

> HI All,
> I am getting "*.UnauthorizedException: User  has no SELECT
> permission on  or any of its parents*" error
> while Spark job is fetching data from Cassandra but could able to save data
> into Cassandra with out any issues
>
> Note: With the same user ,  I could able to access and query
> the table in CQL UI and code used in Spark Job has been tested in Spark
> Shell and it is working fine
>
> Please find the below stack trace
>
> WARN  2015-11-17 07:24:23 org.apache.spark.scheduler.DAGScheduler:
> Creating new stage failed due to exception - job: 0
>
> com.datastax.driver.core.exceptions.UnauthorizedException:* User <*
> *UserIDXYZ**> has no SELECT permission on 
> or any of its parents*
>
> at
> com.datastax.driver.core.exceptions.UnauthorizedException.copy(UnauthorizedException.java:36)
> ~[cassandra-driver-core-2.1.7.1.jar:na]
>
> at
> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:269)
> ~[cassandra-driver-core-2.1.7.1.jar:na]
>
> at
> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:183)
> ~[cassandra-driver-core-2.1.7.1.jar:na]
>
> at
> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
> ~[cassandra-driver-core-2.1.7.1.jar:na]
>
> at
> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44)
> ~[cassandra-driver-core-2.1.7.1.jar:na]
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.8.0_51]
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:1.8.0_51]
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.8.0_51]
>
> at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
>
> at
> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.8.0_51]
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:1.8.0_51]
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.8.0_51]
>
>at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
>
> at
> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:40)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:38)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges$lzycompute(DataSizeEstimates.scala:38)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges(DataSizeEstimates.scala:37)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.dataSizeInBytes$lzycompute(DataSizeEstimates.scala:81)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> at
> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.dataSizeInBytes(DataSizeEstimates.scala:80)
> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>
> 

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread 임정택
Ted,

Could you elaborate, please?

I maintain separated HBase cluster and Mesos cluster for some reasons, and
I just can make it work via spark-submit or spark-shell / zeppelin with
newly initialized SparkContext.

Thanks,
Jungtaek Lim (HeartSaVioR)

2015-11-17 22:17 GMT+09:00 Ted Yu :

> I am a bit curious:
> Hbase depends on hdfs.
> Has hdfs support for Mesos been fully implemented ?
>
> Last time I checked, there was still work to be done.
>
> Thanks
>
> On Nov 17, 2015, at 1:06 AM, 임정택  wrote:
>
> Oh, one thing I missed is, I built Spark 1.4.1 Cluster with 6 nodes of
> Mesos 0.22.1 H/A (via ZK) cluster.
>
> 2015-11-17 18:01 GMT+09:00 임정택 :
>
>> Hi all,
>>
>> I'm evaluating zeppelin to run driver which interacts with HBase.
>> I use fat jar to include HBase dependencies, and see failures on executor
>> level.
>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>>
>> I loaded fat jar via --jars option,
>>
>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>>
>> and run driver code using provided SparkContext instance, and see
>> failures from spark-shell console and executor logs.
>>
>> below is stack traces,
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>> at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at scala.Option.foreach(Option.scala:236)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>>
>> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 
>> 14)
>> java.lang.ExceptionInInitializerError
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> 

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread Ted Yu
I see - your HBase cluster is separate from Mesos cluster.
I somehow got (incorrect) impression that HBase cluster runs on Mesos.

On Tue, Nov 17, 2015 at 7:53 PM, 임정택  wrote:

> Ted,
>
> Could you elaborate, please?
>
> I maintain separated HBase cluster and Mesos cluster for some reasons, and
> I just can make it work via spark-submit or spark-shell / zeppelin with
> newly initialized SparkContext.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2015-11-17 22:17 GMT+09:00 Ted Yu :
>
>> I am a bit curious:
>> Hbase depends on hdfs.
>> Has hdfs support for Mesos been fully implemented ?
>>
>> Last time I checked, there was still work to be done.
>>
>> Thanks
>>
>> On Nov 17, 2015, at 1:06 AM, 임정택  wrote:
>>
>> Oh, one thing I missed is, I built Spark 1.4.1 Cluster with 6 nodes of
>> Mesos 0.22.1 H/A (via ZK) cluster.
>>
>> 2015-11-17 18:01 GMT+09:00 임정택 :
>>
>>> Hi all,
>>>
>>> I'm evaluating zeppelin to run driver which interacts with HBase.
>>> I use fat jar to include HBase dependencies, and see failures on
>>> executor level.
>>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>>>
>>> I loaded fat jar via --jars option,
>>>
>>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>>>
>>> and run driver code using provided SparkContext instance, and see
>>> failures from spark-shell console and executor logs.
>>>
>>> below is stack traces,
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>> at 
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>>> at 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>> at scala.Option.foreach(Option.scala:236)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>>
>>>
>>> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 
>>> 14)
>>> java.lang.ExceptionInInitializerError
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>> at 
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>>> at 

Incorrect results with reduceByKey

2015-11-17 Thread tovbinm
Howdy,

We've noticed a strange behavior with Avro serialized data and reduceByKey
RDD functionality. Please see below:

 // We're reading a bunch of Avro serialized data
val data: RDD[T] = sparkContext.hadoopFile(path,
classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable])
// Incorrect data returned
val bad: RDD[(String,List[T])] = data.map(r => (r.id,
List(r))).reduceByKey(_ ++ _)
// After adding the partitioner we get everything as expected
val good: RDD[(String,List[T])] = data.map(r => (r.id,
List(r))).partitionBy(Partitioner.defaultPartitioner(data)).reduceByKey(_ ++
_)


Any ideas? 

Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Incorrect-results-with-reduceByKey-tp25410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread 임정택
Ted,

Thanks for the reply.

My fat jar has dependency with spark related library to only spark-core as
"provided".
Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in spark-example
module.

And if there're two hbase-default.xml in the classpath, should one of them
be loaded, instead of showing (null)?

Best,
Jungtaek Lim (HeartSaVioR)



2015-11-18 13:50 GMT+09:00 Ted Yu :

> Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
> and another for 0.98.7-hadoop2 (used by Spark)
>
> You can specify hbase.defaults.for.version.skip as true in your
> hbase-site.xml
>
> Cheers
>
> On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:
>
>> Hi all,
>>
>> I'm evaluating zeppelin to run driver which interacts with HBase.
>> I use fat jar to include HBase dependencies, and see failures on executor
>> level.
>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>>
>> I loaded fat jar via --jars option,
>>
>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>>
>> and run driver code using provided SparkContext instance, and see
>> failures from spark-shell console and executor logs.
>>
>> below is stack traces,
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>> at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at scala.Option.foreach(Option.scala:236)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>>
>> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 
>> 14)
>> java.lang.ExceptionInInitializerError
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> 

Re: Spark build error

2015-11-17 Thread Ted Yu
Is the Scala version in Intellij the same as the one used by sbt ?

Cheers

On Tue, Nov 17, 2015 at 6:45 PM, 金国栋  wrote:

> Hi!
>
> I tried to build spark source code from github, and I successfully built
> it from command line using `*sbt/sbt assembly*`. While I encountered an
> error when compiling the project in Intellij IDEA(V14.1.5).
>
>
> The error log is below:
> *Error:scala: *
> * while compiling:
> /Users/ray/Documents/P01_Project/Spark-Github/spark/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala*
> *during phase: jvm*
>  library version: version 2.10.5
> compiler version: version 2.10.5
>   reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
> -classpath
> 

Re: Spark build error

2015-11-17 Thread Jeff Zhang
This also bother me for a long time. I suspect the intellij builder
conflicts with the sbt/maven builder.

I resolve this issue by rebuild spark in intellij.  You may meet
compilation issue when building it in intellij.
For that you need to put external/flume-sink/target/java on the source
build path.



On Wed, Nov 18, 2015 at 12:02 PM, Ted Yu  wrote:

> Is the Scala version in Intellij the same as the one used by sbt ?
>
> Cheers
>
> On Tue, Nov 17, 2015 at 6:45 PM, 金国栋  wrote:
>
>> Hi!
>>
>> I tried to build spark source code from github, and I successfully built
>> it from command line using `*sbt/sbt assembly*`. While I encountered an
>> error when compiling the project in Intellij IDEA(V14.1.5).
>>
>>
>> The error log is below:
>> *Error:scala: *
>> * while compiling:
>> /Users/ray/Documents/P01_Project/Spark-Github/spark/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala*
>> *during phase: jvm*
>>  library version: version 2.10.5
>> compiler version: version 2.10.5
>>   reconstructed args: -nobootcp -javabootclasspath : -deprecation
>> -feature -classpath
>> 

Re: Issue while Spark Job fetching data from Cassandra DB

2015-11-17 Thread Ted Yu
Have you considered polling Cassandra mailing list ?

A brief search led to CASSANDRA-7894

FYI

On Tue, Nov 17, 2015 at 7:24 PM, satish chandra j 
wrote:

> HI All,
> I am getting "*.UnauthorizedException: User  has no SELECT
> permission on  or any of its parents*" error
> while Spark job is fetching data from Cassandra but could able to save data
> into Cassandra with out any issues
>
> Note: With the same user ,  I could able to access and query
> the table in CQL UI and code used in Spark Job has been tested in Spark
> Shell and it is working fine
>
> Regards,
> Satish Chandra
>
> On Tue, Nov 17, 2015 at 11:45 PM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI All,
>> I am getting "*.UnauthorizedException: User  has no SELECT
>> permission on  or any of its parents*"
>> error while Spark job is fetching data from Cassandra but could able to
>> save data into Cassandra with out any issues
>>
>> Note: With the same user ,  I could able to access and query
>> the table in CQL UI and code used in Spark Job has been tested in Spark
>> Shell and it is working fine
>>
>> Please find the below stack trace
>>
>> WARN  2015-11-17 07:24:23 org.apache.spark.scheduler.DAGScheduler:
>> Creating new stage failed due to exception - job: 0
>>
>> com.datastax.driver.core.exceptions.UnauthorizedException:* User <*
>> *UserIDXYZ**> has no SELECT permission on 
>> or any of its parents*
>>
>> at
>> com.datastax.driver.core.exceptions.UnauthorizedException.copy(UnauthorizedException.java:36)
>> ~[cassandra-driver-core-2.1.7.1.jar:na]
>>
>> at
>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:269)
>> ~[cassandra-driver-core-2.1.7.1.jar:na]
>>
>> at
>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:183)
>> ~[cassandra-driver-core-2.1.7.1.jar:na]
>>
>> at
>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>> ~[cassandra-driver-core-2.1.7.1.jar:na]
>>
>> at
>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44)
>> ~[cassandra-driver-core-2.1.7.1.jar:na]
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> ~[na:1.8.0_51]
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> ~[na:1.8.0_51]
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> ~[na:1.8.0_51]
>>
>> at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
>>
>> at
>> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> ~[na:1.8.0_51]
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> ~[na:1.8.0_51]
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> ~[na:1.8.0_51]
>>
>>at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
>>
>> at
>> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at com.sun.proxy.$Proxy10.execute(Unknown Source) ~[na:na]
>>
>> at
>> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:40)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates$$anonfun$tokenRanges$1.apply(DataSizeEstimates.scala:38)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges$lzycompute(DataSizeEstimates.scala:38)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> com.datastax.spark.connector.rdd.partitioner.DataSizeEstimates.tokenRanges(DataSizeEstimates.scala:37)
>> ~[spark-cassandra-connector_2.10-1.4.0.jar:1.4.0]
>>
>> at
>> 

Re: how can evenly distribute my records in all partition

2015-11-17 Thread Sonal Goyal
Think about how you want to distribute your data and how your keys are
spread currently. Do you want to compute something per day, per week etc.
Based on that, return a partition number. You could use mod 30 or some such
function to get the partitions.
On Nov 18, 2015 5:17 AM, "prateek arora"  wrote:

> Hi
> I am trying to implement custom partitioner using this link
> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
> ( in link example key value is from 0 to (noOfElement - 1))
>
> but not able to understand how i  implement  custom partitioner  in my
> case:
>
> my parent RDD have 4 partition and RDD key is : TimeStamp and Value is
> JPEG Byte Array
>
>
> Regards
> Prateek
>
>
> On Tue, Nov 17, 2015 at 9:28 AM, Ted Yu  wrote:
>
>> Please take a look at the following for example:
>>
>> ./core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala
>> ./core/src/main/scala/org/apache/spark/Partitioner.scala
>>
>> Cheers
>>
>> On Tue, Nov 17, 2015 at 9:24 AM, prateek arora <
>> prateek.arora...@gmail.com> wrote:
>>
>>> Hi
>>> Thanks
>>> I am new in spark development so can you provide some help to write a
>>> custom partitioner to achieve this.
>>> if you have and link or example to write custom partitioner please
>>> provide to me.
>>>
>>> On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
 You can write your own custom partitioner to achieve this

 Regards
 Sab
 On 17-Nov-2015 1:11 am, "prateek arora" 
 wrote:

> Hi
>
> I have a RDD with 30 record ( Key/value pair ) and running 30 executor
> . i
> want to reparation this RDD in to 30 partition so every partition  get
> one
> record and assigned to one executor .
>
> when i used rdd.repartition(30) its repartition my rdd in 30 partition
> but
> some partition get 2 record , some get 1 record and some not getting
> any
> record .
>
> is there any way in spark so i can evenly distribute my record in all
> partition .
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>>
>


Re: spark-submit stuck and no output in console

2015-11-17 Thread Kayode Odeyemi
Anyone experienced this issue as well?

On Mon, Nov 16, 2015 at 8:06 PM, Kayode Odeyemi  wrote:

>
> Or are you saying that the Java process never even starts?
>
>
> Exactly.
>
> Here's what I got back from jstack as expected:
>
> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316
> 31316: Unable to open socket file: target process not responding or
> HotSpot VM not loaded
> The -F option can be used when the target process is not responding
> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 -F
> Attaching to core -F from executable 31316, please wait...
> Error attaching to core file: Can't attach to the core file
>
>
>


-- 
Odeyemi 'Kayode O.
http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde


synchronizing streams of different kafka topics

2015-11-17 Thread Antony Mayi
Hi,
I have two streams coming from two different kafka topics. the two topics 
contain time related events but are quite asymmetric in volume. I would 
obviously need to process them in sync to get the time related events together 
but with same processing rate if the heavier stream starts backlogging the 
events from the tinier stream would be coming ahead of the relevant events that 
are still in the backlog of the heavy stream.
Is there any way to get the smaller stream processed with slower rate so that 
the relevant events come together with the heavy stream?
Thanks,Antony.

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
How did the example spark jobs go? SparkPI etc..?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World

Reifier at Spark Summit 2015






On Tue, Nov 17, 2015 at 3:24 PM, Kayode Odeyemi  wrote:

> Thanks for the reply Sonal.
>
> I'm on JDK 7 (/usr/lib/jvm/java-7-oracle)
>
> My env is a YARN cluster made of 7 nodes (6 datanodes/
> node manager, 1 namenode/resource manager).
>
> On the namenode, is where I executed the spark-submit job while on one of
> the datanodes,  I executed 'hadoop fs -put /binstore /user/hadoop-user/'
> to dump 1TB of data into all the datanodes. That process is still running
> without hassle and it's only using 1.3 GB of 1.7g heap space.
>
> Initially, I submitted 2 jobs to the YARN cluster which was running for 2
> days and suddenly stops. Nothing in the logs shows the root cause.
>
>
> On Tue, Nov 17, 2015 at 11:42 AM, Sonal Goyal 
> wrote:
>
>> Could it be jdk related ? Which version are you on?
>>
>> Best Regards,
>> Sonal
>> Founder, Nube Technologies 
>> Reifier at Strata Hadoop World
>> 
>> Reifier at Spark Summit 2015
>> 
>>
>> 
>>
>>
>>
>> On Tue, Nov 17, 2015 at 2:48 PM, Kayode Odeyemi 
>> wrote:
>>
>>> Anyone experienced this issue as well?
>>>
>>> On Mon, Nov 16, 2015 at 8:06 PM, Kayode Odeyemi 
>>> wrote:
>>>

 Or are you saying that the Java process never even starts?


 Exactly.

 Here's what I got back from jstack as expected:

 hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316
 31316: Unable to open socket file: target process not responding or
 HotSpot VM not loaded
 The -F option can be used when the target process is not responding
 hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 -F
 Attaching to core -F from executable 31316, please wait...
 Error attaching to core file: Can't attach to the core file



>>>
>>>
>>> --
>>> Odeyemi 'Kayode O.
>>> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>>>
>>
>>
>
>
> --
> Odeyemi 'Kayode O.
> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>


Off-heap memory usage of Spark Executors keeps increasing

2015-11-17 Thread b.schopman
Hi,

The off-heap memory usage of the 3 Spark executor processes keeps increasing 
constantly until the boundaries of the physical RAM are hit. This happened two 
weeks ago, at which point the system comes to a grinding halt, because it's 
unable to spawn new processes. At such a moment restarting Spark is the obvious 
solution. In the collectd memory usage graph in the link below (1) we see two 
moments that we've restarted Spark: last week when we upgraded Spark from 1.4.1 
to 1.5.1 and two weeks ago when the physical memory was exhausted.
(1) http://i.stack.imgur.com/P4DE3.png

As can be seen at the bottom of this mail (2), the Spark executor process uses 
approx. 62GB of memory, while the heap size max is set to 20GB. This means the 
off-heap memory usage is approx. 42GB.

Some info:
 - We use Spark Streaming lib.
 - Our code is written in Java.
 - We run Oracle Java v1.7.0_76
 - Data is read from Kafka (Kafka runs on different boxes).
 - Data is written to Cassandra (Cassandra runs on different boxes).
 - 1 Spark master and 3 Spark executors/workers, running on 4 separate boxes.
 - We recently upgraded Spark from 1.3.1 to 1.4.1 and 1.5.1 and the memory 
usage pattern is identical on all those versions.
 - We recently enabled `setting spark.worker.cleanup.enabled` and set 
`spark.worker.cleanup.interval` to 7200 (seconds), but this doesn't change the 
situation.

What can be the cause of this ever-increasing off-heap memory use?

PS: I've posted this question on StackOverflow yesterday: 
http://stackoverflow.com/questions/33668035/spark-executors-off-heap-memory-usage-keeps-increasing

(2)
$ ps aux | grep 40724
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
apache-+ 40724  140 47.1 75678780 62181644 ?   Sl   Nov06 11782:27 
/usr/lib/jvm/java-7-oracle/jre/bin/java -cp 
/opt/spark-1.5.1-bin-hadoop2.4/conf/:/opt/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar
 -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 
-Dspark.executor.port=7202 -Dspark.broadcast.port=7204 
-Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 
-XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend 
--driver-url 
akka.tcp://sparkdri...@xxx.xxx.xxx.xxx:7201/user/CoarseGrainedScheduler 
--executor-id 2 --hostname xxx.xxx.xxx.xxx --cores 10 --app-id 
app-20151106125547- --worker-url 
akka.tcp://sparkwor...@xxx.xxx.xxx.xxx:7200/user/Worker
$ sudo -u apache-spark jps
40724 CoarseGrainedExecutorBackend
40517 Worker
30664 Jps
$ sudo -u apache-spark jstat -gc 40724
 S0CS1CS0US1U  EC   EUOC OU   PC PU 
   YGC YGCTFGCFGCT GCT
158720.0 157184.0 110339.8  0.0   6674944.0 1708036.1 13981184.0 2733206.2  
59904.0 59551.9  41944 1737.864  39 13.464 1751.328
$ sudo -u apache-spark jps -v
40724 CoarseGrainedExecutorBackend -Xms20480M -Xmx20480M 
-Dspark.driver.port=7201 -Dspark.blockManager.port=7206 
-Dspark.executor.port=7202 -Dspark.broadcast.port=7204 
-Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 
-XX:MaxPermSize=256m
40517 Worker -Xms2048m -Xmx2048m -XX:MaxPermSize=256m
10693 Jps -Dapplication.home=/usr/lib/jvm/java-7-oracle -Xms8m

Kind regards,

Balthasar Schopman
Software Developer
LeaseWeb Technologies B.V.

T: +31 20 316 0232
M:
E: b.schop...@tech.leaseweb.com
W: http://www.leaseweb.com

Luttenbergweg 8, 1101 EC Amsterdam, Netherlands






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Off-heap-memory-usage-of-Spark-Executors-keeps-increasing-tp25398.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread Nick Pentreath
I think the issue with pulling in all of spark-core is often with
dependencies (and versions) conflicting with the web framework (or Akka in
many cases). Plus it really is quite heavy if you just want a fairly
lightweight model-serving app. For example we've built a fairly simple but
scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
really need is the web framework and Breeze (or an alternative linear
algebra lib).

I definitely hear the pain-point that PMML might not be able to handle some
types of transformations or models that exist in Spark. However, here's an
example from scikit-learn -> PMML that may be instructive (
https://github.com/scikit-learn/scikit-learn/issues/1596 and
https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list of
estimators and transformers are supported (including e.g. scaling and
encoding, and PCA).

I definitely think the current model I/O and "export" or "deploy to
production" situation needs to be improved substantially. However, you are
left with the following options:

(a) build out a lightweight "spark-ml-common" project that brings in the
dependencies needed for production scoring / transformation in independent
apps. However, here you only support Scala/Java - what about R and Python?
Also, what about the distributed models? Perhaps "local" wrappers can be
created, though this may not work for very large factor or LDA models. See
also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html

(b) build out Spark's PMML support, and add missing stuff to PMML where
possible. The benefit here is an existing standard with various tools for
scoring (via REST server, Java app, Pig, Hive, various language support).

(c) build out a more comprehensive I/O, serialization and scoring
framework. Here you face the issue of supporting various predictors and
transformers generically, across platforms and versioning. i.e. you're
re-creating a new standard like PMML

Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
specific", or even too "Scala / Java" specific. But it is still potentially
very useful to Spark users to build this out and have a somewhat standard
production serving framework and/or library (there are obviously existing
options like PredictionIO etc).

Option (b) is really building out the existing PMML support within Spark,
so a lot of the initial work has already been done. I know some folks had
(or have) licensing issues with some components of JPMML (e.g. the
evaluator and REST server). But perhaps the solution here is to build an
Apache2-licensed evaluator framework.

Option (c) is obviously interesting - "let's build a better PMML (that uses
JSON or whatever instead of XML!)". But it also seems like a huge amount of
reinventing the wheel, and like any new standard would take time to garner
wide support (if at all).

It would be really useful to start to understand what the main missing
pieces are in PMML - perhaps the lowest-hanging fruit is simply to
contribute improvements or additions to PMML.



On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> That may not be an issue if the app using the models runs by itself (not
> bundled into an existing app), which may actually be the right way to
> design it considering separation of concerns.
>
> Regards
> Sab
>
> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai  wrote:
>
>> This will bring the whole dependencies of spark will may break the web
>> app.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando  wrote:
>>
>>>
>>>
>>> On Fri, Nov 13, 2015 at 2:04 AM, darren  wrote:
>>>
 I agree 100%. Making the model requires large data and many cpus.

 Using it does not.

 This is a very useful side effect of ML models.

 If mlib can't use models outside spark that's a real shame.

>>>
>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>> predictions, save the serialized model and deserialize to use. (you need
>>> the Spark Jars in the classpath though)
>>>


 Sent from my Verizon Wireless 4G LTE smartphone


  Original message 
 From: "Kothuvatiparambil, Viju" <
 viju.kothuvatiparam...@bankofamerica.com>
 Date: 11/12/2015 3:09 PM (GMT-05:00)
 To: DB Tsai , Sean Owen 
 Cc: Felix Cheung , Nirmal Fernando <
 nir...@wso2.com>, Andy Davidson ,
 Adrian Tanase , "user @spark" ,
 Xiangrui Meng , hol...@pigscanfly.ca
 Subject: RE: thought experiment: use spark ML to real time prediction

 I am glad to see DB’s comments, 

Re: Distributing Python code packaged as tar balls

2015-11-17 Thread Praveen Chundi

Thank you for the reply. I am using zip files for now.
Documentation for 1.5.2 mentions use of zip files or eggs, maybe a 
'note' that tar's are not supported might be helpful to some.

https://spark.apache.org/docs/latest/submitting-applications.html

Best Regards,
Praveen

On 14.11.2015 00:40, Davies Liu wrote:

Python does not support library as tar balls, so PySpark may also not
support that.

On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi  wrote:

Hi,

Pyspark/spark-submit offers a --py-files handle to distribute python code
for execution. Currently(version 1.5) only zip files seem to be supported, I
have tried distributing tar balls unsuccessfully.

Is it worth adding support for tar balls?

Best regards,
Praveen Chundi

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread 임정택
Oh, one thing I missed is, I built Spark 1.4.1 Cluster with 6 nodes of
Mesos 0.22.1 H/A (via ZK) cluster.

2015-11-17 18:01 GMT+09:00 임정택 :

> Hi all,
>
> I'm evaluating zeppelin to run driver which interacts with HBase.
> I use fat jar to include HBase dependencies, and see failures on executor
> level.
> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>
> I loaded fat jar via --jars option,
>
> > ./bin/spark-shell --jars hbase-included-assembled.jar
>
> and run driver code using provided SparkContext instance, and see failures
> from spark-shell console and executor logs.
>
> below is stack traces,
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 0.0 
> (TID 281, ): java.lang.NoClassDefFoundError: Could not 
> initialize class org.apache.hadoop.hbase.client.HConnectionManager
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
>
> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14)
> java.lang.ExceptionInInitializerError
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at 

Re: YARN Labels

2015-11-17 Thread Steve Loughran



One of our clusters runs on AWS with a portion of the nodes being spot nodes. 
We would like to force the application master not to run on spot nodes. For 
what ever reason, application master is not able to recover in cases the node 
where it was running suddenly disappears, which is the case with spot nodes.

Standard strategy here is to label the nodes "persistent" and "spot", submit 
the job into a queue which always launches in "spot".


Any guidance on this topic is appreciated.

Alex Rovner
Director, Data Engineering
o: 646.759.0052

[http://www.attributionrevolution.com/wp-content/uploads/2012/08/Magnetic_Logo.png]



Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Steve Loughran

On 17 Nov 2015, at 02:00, Nikhil Gs 
> wrote:

Hello Team,

Below is the error which we are facing in our cluster after 14 hours of 
starting the spark submit job. Not able to understand the issue and why its 
facing the below error after certain time.

If any of you have faced the same scenario or if you have any idea then please 
guide us. To identify the issue, if you need any other info then please revert 
me back with the requirement.Thanks a lot in advance.

Log Error:

15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed. The 
most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]


I keep my list of causes of error messages online: 
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html

Spark only support long-lived work on a kerberos cluster from 1.5+, with a 
keytab being supplied to the job. Without this, the yarn client grabs some 
tickets at launch time and hangs on until they expire, which for you is 14 hours

(For anyone using ticket-at-launch auth, know that Spark 1.5.0-1.5.2 doesnt 
talk to Hive on a kerberized cluster; some reflection-related issues which 
weren't picked up during testing. 1.5.3 will fix this


zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread 임정택
Hi all,

I'm evaluating zeppelin to run driver which interacts with HBase.
I use fat jar to include HBase dependencies, and see failures on executor
level.
I thought it is zeppelin's issue, but it fails on spark-shell, too.

I loaded fat jar via --jars option,

> ./bin/spark-shell --jars hbase-included-assembled.jar

and run driver code using provided SparkContext instance, and see failures
from spark-shell console and executor logs.

below is stack traces,

org.apache.spark.SparkException: Job aborted due to stage failure:
Task 55 in stage 0.0 failed 4 times, most recent failure: Lost task
55.3 in stage 0.0 (TID 281, ):
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hbase.client.HConnectionManager
at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14)
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: hbase-default.xml file seems to
be for and old version of HBase (null), this version is
0.98.6-cdh5.2.0
at 
org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:73)
at 

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-17 Thread 임정택
I just make it work from both side (zeppelin, spark-shell) via initializing
another SparkContext and run.
But since it feels me as a workaround, so I'd love to get proper ways (or
more beautiful workarounds) to resolve this.
Please let me know if you have any suggestions.

Best,
Jungtaek Lim (HeartSaVioR)

2015-11-17 18:06 GMT+09:00 임정택 :

> Oh, one thing I missed is, I built Spark 1.4.1 Cluster with 6 nodes of
> Mesos 0.22.1 H/A (via ZK) cluster.
>
> 2015-11-17 18:01 GMT+09:00 임정택 :
>
>> Hi all,
>>
>> I'm evaluating zeppelin to run driver which interacts with HBase.
>> I use fat jar to include HBase dependencies, and see failures on executor
>> level.
>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>>
>> I loaded fat jar via --jars option,
>>
>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>>
>> and run driver code using provided SparkContext instance, and see
>> failures from spark-shell console and executor logs.
>>
>> below is stack traces,
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>> at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at scala.Option.foreach(Option.scala:236)
>> at 
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>> at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>>
>> 15/11/16 18:59:57 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 
>> 14)
>> java.lang.ExceptionInInitializerError
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at 
>> 

Re: spark-submit stuck and no output in console

2015-11-17 Thread Kayode Odeyemi
Thanks for the reply Sonal.

I'm on JDK 7 (/usr/lib/jvm/java-7-oracle)

My env is a YARN cluster made of 7 nodes (6 datanodes/
node manager, 1 namenode/resource manager).

On the namenode, is where I executed the spark-submit job while on one of
the datanodes,  I executed 'hadoop fs -put /binstore /user/hadoop-user/' to
dump 1TB of data into all the datanodes. That process is still running
without hassle and it's only using 1.3 GB of 1.7g heap space.

Initially, I submitted 2 jobs to the YARN cluster which was running for 2
days and suddenly stops. Nothing in the logs shows the root cause.


On Tue, Nov 17, 2015 at 11:42 AM, Sonal Goyal  wrote:

> Could it be jdk related ? Which version are you on?
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
> On Tue, Nov 17, 2015 at 2:48 PM, Kayode Odeyemi  wrote:
>
>> Anyone experienced this issue as well?
>>
>> On Mon, Nov 16, 2015 at 8:06 PM, Kayode Odeyemi 
>> wrote:
>>
>>>
>>> Or are you saying that the Java process never even starts?
>>>
>>>
>>> Exactly.
>>>
>>> Here's what I got back from jstack as expected:
>>>
>>> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316
>>> 31316: Unable to open socket file: target process not responding or
>>> HotSpot VM not loaded
>>> The -F option can be used when the target process is not responding
>>> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 -F
>>> Attaching to core -F from executable 31316, please wait...
>>> Error attaching to core file: Can't attach to the core file
>>>
>>>
>>>
>>
>>
>> --
>> Odeyemi 'Kayode O.
>> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>>
>
>


-- 
Odeyemi 'Kayode O.
http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde


Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
Could it be jdk related ? Which version are you on?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World

Reifier at Spark Summit 2015






On Tue, Nov 17, 2015 at 2:48 PM, Kayode Odeyemi  wrote:

> Anyone experienced this issue as well?
>
> On Mon, Nov 16, 2015 at 8:06 PM, Kayode Odeyemi  wrote:
>
>>
>> Or are you saying that the Java process never even starts?
>>
>>
>> Exactly.
>>
>> Here's what I got back from jstack as expected:
>>
>> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316
>> 31316: Unable to open socket file: target process not responding or
>> HotSpot VM not loaded
>> The -F option can be used when the target process is not responding
>> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 -F
>> Attaching to core -F from executable 31316, please wait...
>> Error attaching to core file: Can't attach to the core file
>>
>>
>>
>
>
> --
> Odeyemi 'Kayode O.
> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>


Re: spark-submit stuck and no output in console

2015-11-17 Thread Steve Loughran

On 17 Nov 2015, at 09:54, Kayode Odeyemi 
> wrote:

Initially, I submitted 2 jobs to the YARN cluster which was running for 2 days 
and suddenly stops. Nothing in the logs shows the root cause.

48 hours is one of those kerberos warning times (as is 24h, 72h and 7 days) 


How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-17 Thread swetha
Hi,

How to return an RDD of key/value pairs from an RDD that has
foreachPartition applied. I have my code something like the following. It
looks like an RDD that has foreachPartition can have only the return type as
Unit. How do I apply foreachPartition and do a save and at the same return a
pair RDD.

 def saveDataPointsBatchNew(records: RDD[(String, (Long,
java.util.LinkedHashMap[java.lang.Long,
java.lang.Float],java.util.LinkedHashMap[java.lang.Long, java.lang.Float],
java.util.HashSet[java.lang.String] , Boolean))])= {
records.foreachPartition({ partitionOfRecords =>
  val dataLoader = new DataLoaderImpl();
  var metricList = new java.util.ArrayList[String]();
  var storageTimeStamp = 0l

  if (partitionOfRecords != null) {
partitionOfRecords.foreach(record => {

if (record._2._1 == 0l) {
entrySet = record._2._3.entrySet()
itr = entrySet.iterator();
while (itr.hasNext()) {
val entry = itr.next();
storageTimeStamp = entry.getKey.toLong
val dayCounts = entry.getValue
metricsDayCounts += record._1 ->(storageTimeStamp,
dayCounts.toFloat)
}
}
   }
}
)
  }

  //Code to insert the last successful batch/streaming timestamp  ends
  dataLoader.saveDataPoints(metricList);
  metricList = null

})
  }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-return-a-pair-RDD-from-an-RDD-that-has-foreachPartition-applied-tp25411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-17 Thread Timothy Chen
Hi Jo,

Thanks for the links, I would expected the properties to be in
scheduler properties but I need to double check.

I'll be looking into these problems this week.

Tim

On Tue, Nov 17, 2015 at 10:28 AM, Jo Voordeckers
 wrote:
> On Tue, Nov 17, 2015 at 5:16 AM, Iulian Dragoș 
> wrote:
>>
>> I think it actually tries to send all properties as part of
>> `SPARK_EXECUTOR_OPTS`, which may not be everything that's needed:
>>
>>
>> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L375-L377
>>
>
> Aha that's interesting, I overlooked that line, I'll debug some more today
> because I know for sure that those options don't make it onto the
> commandline when I was running it in my debugger.
>
>>
>> Can you please open a Jira ticket and describe also the symptoms? This
>> might be related, or the same issue: SPARK-11280 and also SPARK-11327
>
>
> SPARK-11327 is exactly my problem, but I don't run docker.
>
>  - Jo
>
>> On Tue, Nov 17, 2015 at 2:46 AM, Jo Voordeckers 
>> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I'm running the mesos cluster dispatcher, however when I submit jobs with
>>> things like jvm args, classpath order and UI port aren't added to the
>>> commandline executed by the mesos scheduler. In fact it only cares about the
>>> class, jar and num cores/mem.
>>>
>>>
>>> https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424
>>>
>>> I've made an attempt at adding a few of the args that I believe are
>>> useful to the MesosClusterScheduler class, which seems to solve my problem.
>>>
>>> Please have a look:
>>>
>>> https://github.com/apache/spark/pull/9752
>>>
>>> Thanks
>>>
>>> - Jo Voordeckers
>>>
>>
>>
>>
>> --
>>
>> --
>> Iulian Dragos
>>
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney
reading the code, is there any reason why
setting spark.cleaner.ttl.MAP_OUTPUT_TRACKER directly won't get picked up?

2015-11-17 14:45 GMT-05:00 Jonathan Coveney :

> so I have the following...
>
> broadcast some stuff
> cache an rdd
> do a bunch of stuff, eventually calling actions which reduce it to an
> acceptable size
>
> I'm getting an OOM on the driver (well, GC is getting out of control),
> largely because I have a lot of partitions and it looks like the job
> history is getting too large. ttl is an option, but the downside is that it
> will also delete the rdd...this isn't really what I want. what I want is to
> keep my in memory data structures (the rdd, broadcast variable, etc) but
> get rid of the old metadata that I don't need anymore (ie tasks that have
> executed).
>
> Is there a way to achieve this?
>


Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread DB Tsai
I was thinking about to work on better version of PMML, JMML in JSON, but
as you said, this requires a dedicated team to define the standard which
will be a huge work.  However, option b) and c) still don't address the
distributed models issue. In fact, most of the models in production have to
be small enough to return the result to users within reasonable latency, so
I doubt how usefulness of the distributed models in real production
use-case. For R and Python, we can build a wrapper on-top of the
lightweight "spark-ml-common" project.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath 
wrote:

> I think the issue with pulling in all of spark-core is often with
> dependencies (and versions) conflicting with the web framework (or Akka in
> many cases). Plus it really is quite heavy if you just want a fairly
> lightweight model-serving app. For example we've built a fairly simple but
> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
> really need is the web framework and Breeze (or an alternative linear
> algebra lib).
>
> I definitely hear the pain-point that PMML might not be able to handle
> some types of transformations or models that exist in Spark. However,
> here's an example from scikit-learn -> PMML that may be instructive (
> https://github.com/scikit-learn/scikit-learn/issues/1596 and
> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list
> of estimators and transformers are supported (including e.g. scaling and
> encoding, and PCA).
>
> I definitely think the current model I/O and "export" or "deploy to
> production" situation needs to be improved substantially. However, you are
> left with the following options:
>
> (a) build out a lightweight "spark-ml-common" project that brings in the
> dependencies needed for production scoring / transformation in independent
> apps. However, here you only support Scala/Java - what about R and Python?
> Also, what about the distributed models? Perhaps "local" wrappers can be
> created, though this may not work for very large factor or LDA models. See
> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>
> (b) build out Spark's PMML support, and add missing stuff to PMML where
> possible. The benefit here is an existing standard with various tools for
> scoring (via REST server, Java app, Pig, Hive, various language support).
>
> (c) build out a more comprehensive I/O, serialization and scoring
> framework. Here you face the issue of supporting various predictors and
> transformers generically, across platforms and versioning. i.e. you're
> re-creating a new standard like PMML
>
> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
> specific", or even too "Scala / Java" specific. But it is still potentially
> very useful to Spark users to build this out and have a somewhat standard
> production serving framework and/or library (there are obviously existing
> options like PredictionIO etc).
>
> Option (b) is really building out the existing PMML support within Spark,
> so a lot of the initial work has already been done. I know some folks had
> (or have) licensing issues with some components of JPMML (e.g. the
> evaluator and REST server). But perhaps the solution here is to build an
> Apache2-licensed evaluator framework.
>
> Option (c) is obviously interesting - "let's build a better PMML (that
> uses JSON or whatever instead of XML!)". But it also seems like a huge
> amount of reinventing the wheel, and like any new standard would take time
> to garner wide support (if at all).
>
> It would be really useful to start to understand what the main missing
> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
> contribute improvements or additions to PMML.
>
>
>
> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> That may not be an issue if the app using the models runs by itself (not
>> bundled into an existing app), which may actually be the right way to
>> design it considering separation of concerns.
>>
>> Regards
>> Sab
>>
>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai  wrote:
>>
>>> This will bring the whole dependencies of spark will may break the web
>>> app.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando 
>>> wrote:
>>>


 On Fri, Nov 13, 2015 at 2:04 AM, darren  wrote:

> I agree 100%. Making the model requires large data and many cpus.
>
> Using it does not.
>
> This is a very useful side effect of ML models.
>
> If mlib can't use models outside spark that's a real shame.
>

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
I forgot to mention. I am using spark-1.5.1-bin-hadoop2.6

From:  Andrew Davidson 
Date:  Tuesday, November 17, 2015 at 2:26 PM
To:  "user @spark" 
Subject:  Re: WARN LoadSnappy: Snappy native library not loaded

> FYI
> 
> After 17 min. only 26112/228155 have succeeded
> 
> This seems very slow
> 
> Kind regards
> 
> Andy
> 
> 
> 
> From:  Andrew Davidson 
> Date:  Tuesday, November 17, 2015 at 2:22 PM
> To:  "user @spark" 
> Subject:  WARN LoadSnappy: Snappy native library not loaded
> 
> 
>> I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I
>> have 3 slaves. In general I am running into trouble even with small work
>> loads. I am using IPython notebooks running on my spark cluster.
>> Everything is painfully slow. I am using the standAlone cluster manager.
>> I noticed that I am getting the following warning on my driver console.
>> Any idea what the problem might be?
>> 
>> 
>> 
>> 15/11/17 22:01:59 WARN MetricsSystem: Using default name DAGScheduler for
>> source because spark.app.id is not set.
>> 15/11/17 22:03:05 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 15/11/17 22:03:05 WARN LoadSnappy: Snappy native library not loaded
>> 
>> 
>> 
>> Here is an overview of my POS app. I have a file on hdfs containing about
>> 5000 twitter status strings.
>> 
>> tweetStrings = sc.textFile(dataURL)
>> 
>> jTweets = (tweetStrings.map(lambda x: json.loads(x)).take(10))
>> 
>> 
>> Generated the following error ³error occurred while calling
>> o78.partitions.: java.lang.OutOfMemoryError: GC overhead limit exceeded²
>> 
>> Any idea what we need to do to improve new spark user¹s out of the box
>> experience?
>> 
>> Kind regards
>> 
>> Andy
>> 
>> export PYSPARK_PYTHON=python3.4
>> export PYSPARK_DRIVER_PYTHON=python3.4
>> export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"
>> 
>> MASTER_URL=spark://ec2-55-218-207-122.us-west-1.compute.amazonaws.com:7077
>> 
>> 
>> numCores=2
>> $SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores
>> $numCores $*




Any way to get raw score from MultilayerPerceptronClassificationModel ?

2015-11-17 Thread Robert Dodier
Hi,

I'd like to get the raw prediction score from a
MultilayerPerceptronClassificationModel. It appears that the 'predict'
method only returns the argmax of the largest score in the output
layer (line 200 in MultilayerPerceptronClassificationModel.scala in
Spark 1.5.2).

Is there any way to get the raw score? It is computed as
mlpModel.predict(features) in the source code. It appears that I can't
access mlpModel since it is private in
MultilayerPerceptronClassificationModel so I couldn't just grab
mlpModel and call predict on it. Is there another way?

Thanks for any light you can shed on this question.

Robert Dodier

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Working with RDD from Java

2015-11-17 Thread Bryan Cutler
Hi Ivan,

Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the
topic distributions called javaTopicDistributions() that returns a
JavaPairRDD.  If you aren't able to upgrade, you can check out the
conversion used here
https://github.com/apache/spark/blob/v1.4.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala#L350

-bryan

On Tue, Nov 17, 2015 at 3:06 AM, frula00 
wrote:

> Hi,
> I'm working in Java, with Spark 1.3.1 - I am trying to extract data from
> the
> RDD returned by
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicDistributions()
> (return type is RDD>). How do I work with it from
> within Java, I can't seem to cast it to JavaPairRDD nor JavaRDD and if I
> try
> to collect it it simply returns an Object?
>
> Thank you for your help in advance!
>
> Ivan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-RDD-from-Java-tp25399.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Invocation of StreamingContext.stop() hangs in 1.5

2015-11-17 Thread jiten
Hi,

 We're using Spark 1.5 streaming. We've a use case where we need to stop
an existing StreamingContext and start a new one primarily to handle a newly
added partition to Kafka topic by creating a new Kafka DStream in the
context of the new StreamingContext.

We've implemented "StreamingListener" trait and invoking
"StreamingContext.stop(false, false)" in "onBatchCompleted" event. In order
not to stop the underlying SparkContext, we've specified
""spark.streaming.stopSparkContextByDefault" to false. The above invocation
never returns.

   Here is the partial stack trace of the JVM 


"StreamingListenerBus" daemon prio=10 tid=0x7fcc1000e800 nid=0x1027 in
Object.wait() [0x7fcc170ef000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x00070688b0c0> (a
org.apache.spark.util.AsynchronousListenerBus$$anon$1)
at java.lang.Thread.join(Thread.java:1355)
at
org.apache.spark.util.AsynchronousListenerBus.stop(AsynchronousListenerBus.scala:167)
at
org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
- locked <0x00070688a878> (a
org.apache.spark.streaming.scheduler.JobScheduler)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:704)
- locked <0x00070624e890> (a
org.apache.spark.streaming.StreamingContext)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:683)
- locked <0x00070624e890> (a
org.apache.spark.streaming.StreamingContext)
at
com.verizon.bda.manager.ApplicationManager$.restartWorflow(ApplicationManager.scala:367)
at
com.verizon.bda.manager.ApplicationListener.onBatchCompleted(ApplicationListener.scala:28)
at
org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:45)
at
org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:26)
at
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

"JobScheduler" daemon prio=10 tid=0x7fcc1000b000 nid=0x1026 waiting on
condition [0x7fcccd25e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007125fd760> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:489)
at
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:46)


Thanks,
Jiten



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Invocation-of-StreamingContext-stop-hangs-in-1-5-tp25402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
FYI

After 17 min. only 26112/228155 have succeeded

This seems very slow

Kind regards

Andy



From:  Andrew Davidson 
Date:  Tuesday, November 17, 2015 at 2:22 PM
To:  "user @spark" 
Subject:  WARN LoadSnappy: Snappy native library not loaded


>I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I
>have 3 slaves. In general I am running into trouble even with small work
>loads. I am using IPython notebooks running on my spark cluster.
>Everything is painfully slow. I am using the standAlone cluster manager.
>I noticed that I am getting the following warning on my driver console.
>Any idea what the problem might be?
>
>
>
>15/11/17 22:01:59 WARN MetricsSystem: Using default name DAGScheduler for
>source because spark.app.id is not set.
>15/11/17 22:03:05 WARN NativeCodeLoader: Unable to load native-hadoop
>library for your platform... using builtin-java classes where applicable
>15/11/17 22:03:05 WARN LoadSnappy: Snappy native library not loaded
>
>
>
>Here is an overview of my POS app. I have a file on hdfs containing about
>5000 twitter status strings.
>
>tweetStrings = sc.textFile(dataURL)
>
>jTweets = (tweetStrings.map(lambda x: json.loads(x)).take(10))
>
>
>Generated the following error ³error occurred while calling
>o78.partitions.: java.lang.OutOfMemoryError: GC overhead limit exceeded²
>
>Any idea what we need to do to improve new spark user¹s out of the box
>experience?
>
>Kind regards
>
>Andy
>
>export PYSPARK_PYTHON=python3.4
>export PYSPARK_DRIVER_PYTHON=python3.4
>export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"
>
>MASTER_URL=spark://ec2-55-218-207-122.us-west-1.compute.amazonaws.com:7077
>
>
>numCores=2
>$SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores
>$numCores $*



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Invocation of StreamingContext.stop() hangs in 1.5

2015-11-17 Thread Ted Yu
I don't think you should call ssc.stop() in StreamingListenerBus thread.

Please stop the context asynchronously.

BTW I have a pending PR:
https://github.com/apache/spark/pull/9741

On Tue, Nov 17, 2015 at 1:50 PM, jiten  wrote:

> Hi,
>
>  We're using Spark 1.5 streaming. We've a use case where we need to
> stop
> an existing StreamingContext and start a new one primarily to handle a
> newly
> added partition to Kafka topic by creating a new Kafka DStream in the
> context of the new StreamingContext.
>
> We've implemented "StreamingListener" trait and invoking
> "StreamingContext.stop(false, false)" in "onBatchCompleted" event. In order
> not to stop the underlying SparkContext, we've specified
> ""spark.streaming.stopSparkContextByDefault" to false. The above invocation
> never returns.
>
>Here is the partial stack trace of the JVM
>
>
> "StreamingListenerBus" daemon prio=10 tid=0x7fcc1000e800 nid=0x1027 in
> Object.wait() [0x7fcc170ef000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1281)
> - locked <0x00070688b0c0> (a
> org.apache.spark.util.AsynchronousListenerBus$$anon$1)
> at java.lang.Thread.join(Thread.java:1355)
> at
>
> org.apache.spark.util.AsynchronousListenerBus.stop(AsynchronousListenerBus.scala:167)
> at
>
> org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
> - locked <0x00070688a878> (a
> org.apache.spark.streaming.scheduler.JobScheduler)
> at
>
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:704)
> - locked <0x00070624e890> (a
> org.apache.spark.streaming.StreamingContext)
> at
>
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:683)
> - locked <0x00070624e890> (a
> org.apache.spark.streaming.StreamingContext)
> at
>
> com.verizon.bda.manager.ApplicationManager$.restartWorflow(ApplicationManager.scala:367)
> at
>
> com.verizon.bda.manager.ApplicationListener.onBatchCompleted(ApplicationListener.scala:28)
> at
>
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:45)
> at
>
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:26)
> at
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
> at
>
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
> at
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
> at
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
>
> "JobScheduler" daemon prio=10 tid=0x7fcc1000b000 nid=0x1026 waiting on
> condition [0x7fcccd25e000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007125fd760> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at
>
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:489)
> at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:678)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:46)
>
>
> Thanks,
> Jiten
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Invocation-of-StreamingContext-stop-hangs-in-1-5-tp25402.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I have
3 slaves. In general I am running into trouble even with small work loads. I
am using IPython notebooks running on my spark cluster. Everything is
painfully slow. I am using the standAlone cluster manager. I noticed that I
am getting the following warning on my driver console. Any idea what the
problem might be?



15/11/17 22:01:59 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.

15/11/17 22:03:05 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

15/11/17 22:03:05 WARN LoadSnappy: Snappy native library not loaded



Here is an overview of my POS app. I have a file on hdfs containing about
5000 twitter status strings.

tweetStrings = sc.textFile(dataURL)
jTweets = (tweetStrings.map(lambda x: json.loads(x)).take(10))

Generated the following error ³error occurred while calling o78.partitions.:
java.lang.OutOfMemoryError: GC overhead limit exceeded²

Any idea what we need to do to improve new spark user¹s out of the box
experience?

Kind regards

Andy

export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"

MASTER_URL=spark://ec2-55-218-207-122.us-west-1.compute.amazonaws.com:7077


numCores=2

$SPARK_ROOT/bin/pyspark --master $MASTER_URL --total-executor-cores
$numCores $*






RE: spark with breeze error of NoClassDefFoundError

2015-11-17 Thread Jack Yang
So weird. Is there anything wrong with the way I made the pom file (I labelled 
them as provided)?

Is there missing jar I forget to add in “--jar”?

See the trace below:



Exception in thread "main" java.lang.NoClassDefFoundError: 
breeze/storage/DefaultArrayValue
at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 10 more
15/11/18 17:15:15 INFO util.Utils: Shutdown hook called


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, 18 November 2015 4:01 PM
To: Jack Yang
Cc: user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError

Looking in local maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue :

jar tvf 
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
grep !$
jar tvf 
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
grep DefaultArrayValue
   369 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$mcZ$sp$class.class
   309 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$mcJ$sp.class
  2233 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class

Can you show the complete stack trace ?

FYI

On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang 
> wrote:
Hi all,
I am using spark 1.4.0, and building my codes using maven.
So in one of my scala, I used:

import breeze.linalg._
val v1 = new breeze.linalg.SparseVector(commonVector.indices, 
commonVector.values, commonVector.size)
val v2 = new breeze.linalg.SparseVector(commonVector2.indices, 
commonVector2.values, commonVector2.size)
println (v1.dot(v2) / (norm(v1) * norm(v2)) )



in my pom.xml file, I used:

 org.scalanlp
 
breeze-math_2.10
 0.4
 provided
  

  
 org.scalanlp
 
breeze_2.10
 0.11.2
 provided
  


When submit, I included breeze jars (breeze_2.10-0.11.2.jar 
breeze-math_2.10-0.4.jar breeze-natives_2.10-0.11.2.jar 
breeze-process_2.10-0.3.jar) using “--jar” arguments, although I doubt it is 
necessary to do that.

however, the error is

Exception in thread "main" java.lang.NoClassDefFoundError: 
breeze/storage/DefaultArrayValue

Any thoughts?



Best regards,
Jack




Re: Additional Master daemon classpath

2015-11-17 Thread memorypr...@gmail.com
Have you tried using
spark.driver.extraClassPath
and
spark.executor.extraClassPath

?

AFAICT these config options replace SPARK_CLASSPATH. Further info in the
docs. I've had good luck with these options, and for ease of use I just set
them in the spark defaults config.

https://spark.apache.org/docs/latest/configuration.html

On Tue, 17 Nov 2015 at 21:06 Michal Klos  wrote:

> Hi,
>
> We are running a Spark Standalone cluster on EMR (note: not using YARN)
> and are trying to use S3 w/ EmrFS as our event logging directory.
>
> We are having difficulties with a ClassNotFoundException on EmrFileSystem
> when we navigate to the event log screen. This is to be expected as the
> EmrFs jars are not on the classpath.
>
> But -- I have not been able to figure out a way to add additional
> classpath jars to the start-up of the Master daemon. SPARK_CLASSPATH has
> been deprecated, and looking around at spark-class, etc.. everything seems
> to be pretty locked down.
>
> Do I have to shove everything into the assembly jar?
>
> Am I missing a simple way to add classpath to the daemons?
>
> thanks,
> Michal
>


Re: Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread Tathagata Das
Can you verify that the cluster is running the correct version of Spark.
1.5.2.

On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy  wrote:

> Sorry compile makes it work locally. But, the cluster
> still seems to have issues with provided. Basically it
> does not seem to process any records, no data is shown in any of the tabs
> of the Streaming UI except the Streaming tab. Executors, Storage, Stages
> etc show empty RDDs.
>
> On Tue, Nov 17, 2015 at 7:19 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi TD,
>>
>> Basically, I see two issues. With provided the job does
>> not start localy. It does start in Cluster but seems  no data is getting
>> processed.
>>
>> Thanks,
>> Swetha
>>
>> On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram 
>> wrote:
>>
>>> If you are running a local context, could it be that you should use:
>>>
>>>
>>>
>>> provided
>>>
>>>
>>>
>>> ?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>>
>>>
>>> *From:* swetha kasireddy [mailto:swethakasire...@gmail.com]
>>> *Sent:* Wednesday, 18 November 2015 2:01 PM
>>> *To:* Tathagata Das
>>> *Cc:* user
>>> *Subject:* Re: Streaming Job gives error after changing to version 1.5.2
>>>
>>>
>>>
>>> This error I see locally.
>>>
>>>
>>>
>>> On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das 
>>> wrote:
>>>
>>> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>>>
>>>
>>>
>>> On Tue, Nov 17, 2015 at 5:34 PM, swetha 
>>> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>> I see  java.lang.NoClassDefFoundError after changing the Streaming job
>>> version to 1.5.2. Any idea as to why this is happening? Following are my
>>> dependencies and the error that I get.
>>>
>>>   
>>> org.apache.spark
>>> spark-core_2.10
>>> ${sparkVersion}
>>> provided
>>> 
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-streaming_2.10
>>> ${sparkVersion}
>>> provided
>>> 
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> ${sparkVersion}
>>> provided
>>> 
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-hive_2.10
>>> ${sparkVersion}
>>> provided
>>> 
>>>
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-streaming-kafka_2.10
>>> ${sparkVersion}
>>> 
>>>
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/streaming/StreamingContext
>>> at java.lang.Class.getDeclaredMethods0(Native Method)
>>> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
>>> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
>>> at java.lang.Class.getMethod0(Class.java:3010)
>>> at java.lang.Class.getMethod(Class.java:1776)
>>> at
>>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.streaming.StreamingContext
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>> _
>>>
>>> The information transmitted in this message and its attachments (if any)
>>> is intended
>>> only for the person or entity to which it is addressed.
>>> The message may contain confidential and/or privileged material. Any
>>> review,
>>> retransmission, dissemination or other use of, or taking of any action
>>> in reliance
>>> upon this information, by persons or entities other than the intended
>>> recipient is
>>> prohibited.
>>>
>>> If you have received this in error, please contact the sender and delete
>>> this e-mail
>>> and associated material from any computer.
>>>
>>> The intended recipient of this e-mail may only use, reproduce, disclose
>>> or distribute
>>> the information contained in this e-mail and any attached files, with
>>> the permission
>>> of the