Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
amples/*jars* > /spark-examples_2.11-2.3.0.jar. > > On Tue, Apr 10, 2018 at 1:34 AM, Dmitry <frostb...@gmail.com> wrote: > >> Hello spent a lot of time to find what I did wrong , but not found. >> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try &

Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Hello spent a lot of time to find what I did wrong , but not found. I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try to run examples against Spark 2.3. Tried several docker images builds: * several builds that I build myself * andrusha/spark-k8s:2.3.0-hadoop2.7 from

Distribution of spark 3.0.1 with Hive1.2

2020-11-10 Thread Dmitry
Hi all, I am trying to make distribution 3.0.1 with spark 3 using ./dev/make-distribution.sh --name spark3-hive12 --pip --tgz -Phive-1.2 -Phadoop-2.7 -Pyarn The problem is maven can't found right profile for hive and build ends without hive jars ++ /Users/reireirei/spark/spark/build/mvn

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Thanks, Cody. Yes, I originally started off by looking at that but I get a compile error if I try and use that approach: constructor JdbcRDD in class JdbcRDDT cannot be applied to given types. Not to mention that JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last argument).

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
/apache/spark/JavaJdbcRDDSuite.java#L90 is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what you tried doing? On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Cody. Yes, I originally started off by looking at that but I get

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I'm not sure what on the driver means but I've tried setting spark.files.userClassPathFirst to true, in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf programmatically; it appears to be ignored. The solution was to follow Emre's recommendation and downgrade the selected Solrj

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
...@koeninger.org wrote: Is sc there a SparkContext or a JavaSparkContext? The compilation error seems to indicate the former, but JdbcRDD.create expects the latter On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I have tried that as well, I get a compile

Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-13 Thread Dmitry Tolpeko
other options? Please point me to any drawbacks. Thanks, Dmitry

Re: Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-17 Thread Dmitry Tolpeko
feedback. Dmitry On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko dmtolp...@gmail.com wrote: Hello, To convert existing Map Reduce jobs to Spark, I need to implement window functions such as FIRST_VALUE, LEAD, LAG and so on. For example, FIRST_VALUE function: Source (1st column is key): A, A1

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Are you proposing I downgrade Solrj's httpclient dependency to be on par with that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest? Solrj has to stay with its selected version. I could try and rebuild Spark with the latest httpclient but I've no idea what effects that may

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
, Emre Sevinc emre.sev...@gmail.com wrote: Hello Dmitry, I had almost the same problem and solved it by using version 4.0.0 of SolrJ: dependency groupIdorg.apache.solr/groupId artifactIdsolr-solrj/artifactId version4.0.0/version /dependency In my case, I was lucky

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thanks, Emre! Will definitely try this. On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote: On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would that not collide

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which looks to be the version chosen by Solrj, rather than the 4.2.6 that Spark's pom mentions. Might work. On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Did you

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
the org.apache.spark.api.java.function.Function interface and pass an instance of that to JdbcRDD.create ? On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Cody, you were right, I had a copy and paste snag where I ended up with a vanilla SparkContext rather than a Java one. I also had

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
, ConnectionFactory is an interface defined inside JdbcRDD, not scala Function0 On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: That's exactly what I was doing. However, I ran into runtime issues with doing that. For instance, I had a public class DbConnection

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
. Of course, a numeric primary key is going to be the most efficient way to do that. On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Yup, I did see that. Good point though, Cody. The mismatch was happening for me when I was trying to get the 'new JdbcRDD

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com wrote: It seems that those archives are not necessarily easy to find stuff in. Is there a search engine on top of them? so as to find e.g. your own posts easily? On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas nicholas.cham

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Yes, and Kafka topics are basically queues. So perhaps what's needed is just KafkaRDD with starting offset being 0 and finish offset being a very large number... Sent from my iPhone On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote: I guess what you mean is not streaming.

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Part of the issues is, when you read messages in a topic, the messages are peeked, not polled, so there'll be no when the queue is empty, as I understand it. So it would seem I'd want to do KafkaUtils.createRDD, which takes an array of OffsetRange's. Each OffsetRange is characterized by topic,

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
. You can then kill the job after the first batch. It's possible you may be able to kill the job from a StreamingListener.onBatchCompleted, but I've never tried and don't know what the consequences may be. On Wed, Apr 29, 2015 at 8:52 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
Sean, How does this model actually work? Let's say we want to run one job as N threads executing one particular task, e.g. streaming data out of Kafka into a search engine. How do we configure our Spark job execution? Right now, I'm seeing this job running as a single thread. And it's quite a

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
RDD is determined by the block interval and the batch interval. If you have a batch interval of 10s and block interval of 1s you'll get 10 partitions of data in the RDD. On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Understood. We'll use the multi-threaded

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
then. If you run code on your driver, it's not distributed. If you run Spark over an RDD with 1 partition, only one task works on it. On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Sean, How does this model actually work? Let's say we want to run one job as N

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
have a batch interval of 10s and block interval of 1s you'll get 10 partitions of data in the RDD. On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Understood. We'll use the multi-threaded code we already have.. How are these execution slots filled up? I

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg
Thanks, Akhil. It looks like in the second example, for Rabbit they're doing this: https://www.rabbitmq.com/mqtt.html. On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I found two examples Java version

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
Dmytiis intial question – you can load large data sets as Batch (Static) RDD from any Spark Streaming App and then join DStream RDDs against them to emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub project *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
” resources / nodes which allows you to double, triple etc in the background/parallel the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
Eftimov evo.efti...@isecc.com wrote: Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Wednesday, June 3

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger; Andrew

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
, spark streaming (spark) will NOT resort to disk – and of course resorting to disk from time to time (ie when there is no free RAM ) and taking a performance hit from that, BUT only until there is no free RAM *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Thursday, May 28

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
in awaitTermination. So what would be a way to trigger the termination in the driver? context.awaitTermination() allows the current thread to wait for the termination of a context by stop() or by an exception - presumably, we need to call stop() somewhere or perhaps throw. Cheers, - Dmitry On Thu, Jun

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
unless you want it to. If you want it, just call cache. On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: set the storage policy for the DStream RDDs to MEMORY AND DISK - it appears the storage level can be specified in the createStream methods

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger; Andrew Or; Gerard

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
? On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger c...@koeninger.org wrote: Depends on what you're reusing multiple times (if anything). Read http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg dgoldenberg

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes in the background and bring them into the cluster fold when fully

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
to time (ie when there is no free RAM ) and taking a performance hit from that, BUT only until there is no free RAM *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Thursday, May 28, 2015 2:34 PM *To:* Evo Eftimov *Cc:* Gerard Maas; spark users *Subject:* Re: FW: Re

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: I think you can use SPM - http://sematext.com/spm - it will give you all Spark and all Kafka metrics, including offsets broken down by topic, etc. out of the box

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html as well? Basically, I'm trying to get a handle on a good

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
then are, a) will Spark sense the addition of a new node / is it sufficient that the cluster manager is aware, then work just starts flowing there? and b) what would be a way to gracefully remove a worker node when the load subsides, so that no currently running Spark job is killed? - Dmitry On Thu, May 28

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Eftimov Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes in the background and bring them into the cluster fold when

Re: Migrate Relational to Distributed

2015-05-23 Thread Dmitry Tolpeko
it can help. Thanks, Dmitry On Sat, May 23, 2015 at 1:22 AM, Brant Seibert brantseib...@hotmail.com wrote: Hi, The healthcare industry can do wonderful things with Apache Spark. But, there is already a very large base of data and applications firmly rooted in the relational paradigm

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
https://spark.apache.org/docs/latest/streaming-programming-guide.html suggests it was intended to be a multiple of the batch interval. The slide duration wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread main java.lang.OutOfMemoryError: GC overhead limit

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Cody

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote: You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
. Hefty file system usage, hefty memory consumption... What can we do to offset some of these costs? On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org wrote: The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
. It sort of seems wrong though since https://spark.apache.org/docs/latest/streaming-programming-guide.html suggests it was intended to be a multiple of the batch interval. The slide duration wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
could terminate as the last batch is being processed... On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger c...@koeninger.org wrote: You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Our additional question

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
N/m, these are just profiling snapshots :) Sorry for the wide distribution. On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com > wrote: > We're seeing a bunch of .snapshot files being created under > /home/spark/Snapshots, such as the following

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
We're seeing a bunch of .snapshot files being created under /home/spark/Snapshots, such as the following for example: CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot SparkSubmit-2015-08-31-shutdown-1.snapshot

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using one

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
it 'works' to call foreachRDD on an RDD? @Dmitry are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The good boy comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote: Sean already answered your question. foreachRDD and foreachPartition are completely different, there's nothing fuzzy or insufficient

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg
Thanks, Akhil. We're trying the conf.setExecutorEnv() approach since we've already got environment variables set. For system properties we'd go the conf.set(spark.) route. We were concerned that doing the below type of thing did not work, which this blog post seems to confirm (

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
"stuck" at 10 seconds. Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4. Thanks. - Dmitry

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
;Starting data partition processing. AppName={}, topic={}.)...", appName, topic); // ... iterate ... log.info("Finished data partition processing (appName={}, topic={}). Documents processed: {}.", appName, topic, docCount); } Any ideas? Thanks. - Dmitry On Thu, Sep 3, 20

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
M, Tathagata Das <t...@databricks.com> wrote: > Why are you checkpointing the direct kafka stream? It serves not purpose. > > TD > > On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I just disabled checkpointing in

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
gt; > > On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> That is good to know. However, that doesn't change the problem I'm >> seeing. Which is that, even with that piece of code commented out >> (stream.checkpoint()), th

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
; in each batch > > (which checkpoint is enabled using > streamingContext.checkpoint(checkpointDir)) and can recover from failure by > reading the exact same data back from Kafka. > > > TD > > On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gma

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
t; On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Tathagata, >> >> Checkpointing is turned on but we were not recovering. I'm looking at the >> logs now, feeding fresh content hours after the restart. Here's a snippet: >&

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
jssc.start(); jssc.awaitTermination(); jssc.close(); On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das <t...@databricks.com> wrote: > Are you sure you are not accidentally recovering from checkpoint? How are > you using StreamingContext.getOrCreate() in your code? > > TD > > On F

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
t;() { @Override public Void call(JavaRDD rdd) throws Exception { ProcessPartitionFunction func = new ProcessPartitionFunction(params); rdd.foreachPartition(func); return null; } }); return jssc; } On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
eninger <c...@koeninger.org> wrote: > Well, I'm not sure why you're checkpointing messages. > > I'd also put in some logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenber

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I've stopped the jobs, the workers, and the master. Deleted the con

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
deleting or moving the contents of the checkpoint directory > and restarting the job? > > On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Sorry, more relevant code below: >> >> SparkConf sparkConf = createSparkConf(a

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
erval at the time of > recovery? Trying to understand your usecase. > > > On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> >> when you use getOrCreate, and there exists a valid checkpoint, it will >> always return the

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
correct numbers before killing any >> job) >> >> Thanks >> Best Regards >> >> On Mon, Sep 14, 2015 at 10:40 PM, Dmitry Goldenberg < >> dgoldenberg...@gmail.com> wrote: >> >>> Is there a way in Spark to automatically terminate

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
ith fewer partitions/replicas, see if it works. > > -adrian > > From: Dmitry Goldenberg > Date: Tuesday, September 29, 2015 at 3:37 PM > To: Adrian Tanase > Cc: "user@spark.apache.org" > Subject: Re: Kafka error "partitions don't have a leader" / > LeaderNo

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
> -adrian > > From: Dmitry Goldenberg > Date: Tuesday, September 29, 2015 at 3:26 PM > To: "user@spark.apache.org" > Subject: Kafka error "partitions don't have a leader" / > LeaderNotAvailableException > > I apologize for posting this Kafka related issue

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
I apologize for posting this Kafka related issue into the Spark list. Have gotten no responses on the Kafka list and was hoping someone on this list could shed some light on the below. --- We're running into

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers" --

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers"

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
release of Spark > command line for running Spark job > > Cheers > > On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> We're seeing this occasionally. Granted, this was caused by a wrinkle in >> the Solr schema but

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
We're seeing this occasionally. Granted, this was caused by a wrinkle in the Solr schema but this bubbled up all the way in Spark and caused job failures. I just checked and SolrException class is actually in the consumer job jar we use. Is there any reason why Spark cannot find the

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
I'm actually not sure how either one of these would possibly cause Spark to find SolrException. Whether the driver or executor class path is first, should it not matter, if the class is in the consumer job jar? On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg <dgoldenberg...@gmail.

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
wrote: > Have you tried the following ? > --conf spark.driver.userClassPathFirst=true --conf spark.executor. > userClassPathFirst=true > > On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Release of Spar

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Thanks, Ted, will try it out. On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > See the tail of this: > https://bugzilla.redhat.com/show_bug.cgi?id=1005811 > > FYI > > > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Is there a way to ensure Spark doesn't write to /tmp directory? We've got spark.local.dir specified in the spark-defaults.conf file to point at another directory. But we're seeing many of these snappy-unknown-***-libsnappyjava.so files being written to /tmp still. Is there a config setting or

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg
f protobuf jar is loaded ahead of hbase-protocol.jar, things start to get > interesting ... > > On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Ted, I think I have tried these settings with the hbase protocol jar, to

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
inting to override its checkpoint duration millis, is there? Is the default there max(batchdurationmillis, 10seconds)? Is there a way to override this? Thanks. On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com> wrote: > > > See inline. > > On Tue, Sep

Re: Broadcast var is null

2015-10-05 Thread Dmitry Pristin
straight from the dev guide >> val broadcastVar = ssc.sparkContext.broadcast(Array(1, 2, 3)) >> >> //Try to print out the value of the broadcast var here >> val transformed = events.transform(rdd => { >> rdd.map(x => { >> if(broadcastVar == null) { >>

[no subject]

2015-11-26 Thread Dmitry Tolpeko

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tach

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
nthan.com> wrote: > > One option could be to store them as blobs in a cache like Redis and then > read + broadcast them from the driver. Or you could store them in HDFS and > read + broadcast from the driver. > > Regards > Sab > >> On Tue, Jan 12, 2016 at 1

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
ng <gene.p...@gmail.com> wrote: > Hi Dmitry, > > Yes, Tachyon can help with your use case. You can read and write to > Tachyon via the filesystem api ( > http://tachyon-project.org/documentation/File-System-API.html). There is > a native Java API as well as a Hadoop-compatible API

Re: [discuss] dropping Python 2.6 support

2016-01-10 Thread Dmitry Kniazev
in the same environment. For example, we use virtualenv to run Spark with Python 2.7 and do not touch system Python 2.6. Thank you, Dmitry 09.01.2016, 06:36, "Sasha Kacanski" <skacan...@gmail.com>: > +1 > Companies that use stock python in redhat 2.6 will need to upgrade or in

  1   2   >