amples/*jars*
> /spark-examples_2.11-2.3.0.jar.
>
> On Tue, Apr 10, 2018 at 1:34 AM, Dmitry <frostb...@gmail.com> wrote:
>
>> Hello spent a lot of time to find what I did wrong , but not found.
>> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
&
Hello spent a lot of time to find what I did wrong , but not found.
I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
to run examples against Spark 2.3. Tried several docker images builds:
* several builds that I build myself
* andrusha/spark-k8s:2.3.0-hadoop2.7 from
Hi all, I am trying to make distribution 3.0.1 with spark 3 using
./dev/make-distribution.sh --name spark3-hive12 --pip --tgz -Phive-1.2
-Phadoop-2.7 -Pyarn
The problem is maven can't found right profile for hive and build ends
without hive jars
++ /Users/reireirei/spark/spark/build/mvn
Thanks, Cody. Yes, I originally started off by looking at that but I get a
compile error if I try and use that approach: constructor JdbcRDD in class
JdbcRDDT cannot be applied to given types. Not to mention that
JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last
argument).
/apache/spark/JavaJdbcRDDSuite.java#L90
is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what
you tried doing?
On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Cody. Yes, I originally started off by looking at that but I get
I'm not sure what on the driver means but I've tried
setting spark.files.userClassPathFirst to true,
in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf
programmatically; it appears to be ignored. The solution was to follow
Emre's recommendation and downgrade the selected Solrj
...@koeninger.org wrote:
Is sc there a SparkContext or a JavaSparkContext? The compilation error
seems to indicate the former, but JdbcRDD.create expects the latter
On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I have tried that as well, I get a compile
other options? Please point me to any drawbacks.
Thanks,
Dmitry
feedback.
Dmitry
On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko dmtolp...@gmail.com
wrote:
Hello,
To convert existing Map Reduce jobs to Spark, I need to implement window
functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
FIRST_VALUE function:
Source (1st column is key):
A, A1
Are you proposing I downgrade Solrj's httpclient dependency to be on par with
that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest?
Solrj has to stay with its selected version. I could try and rebuild Spark with
the latest httpclient but I've no idea what effects that may
, Emre Sevinc emre.sev...@gmail.com wrote:
Hello Dmitry,
I had almost the same problem and solved it by using version 4.0.0 of
SolrJ:
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-solrj/artifactId
version4.0.0/version
/dependency
In my case, I was lucky
Thanks, Emre! Will definitely try this.
On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote:
On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would
that not collide
I think I'm going to have to rebuild Spark with commons.httpclient.version
set to 4.3.1 which looks to be the version chosen by Solrj, rather than the
4.2.6 that Spark's pom mentions. Might work.
On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Hi
Did you
the
org.apache.spark.api.java.function.Function
interface and pass an instance of that to JdbcRDD.create ?
On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Cody, you were right, I had a copy and paste snag where I ended up with a
vanilla SparkContext rather than a Java one. I also had
, ConnectionFactory is an interface defined inside JdbcRDD, not
scala Function0
On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
That's exactly what I was doing. However, I ran into runtime issues with
doing that. For instance, I had a
public class DbConnection
. Of
course, a numeric primary key is going to be the most efficient way to do
that.
On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Yup, I did see that. Good point though, Cody. The mismatch was happening
for me when I was trying to get the 'new JdbcRDD
=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Ted. Well, so far even there I'm only seeing my post and not,
for example, your response.
On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu
, 2015 at 10:49 AM Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
It seems that those archives are not necessarily easy to find stuff in.
Is there a search engine on top of them? so as to find e.g. your own posts
easily?
On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas
nicholas.cham
Yes, and Kafka topics are basically queues. So perhaps what's needed is just
KafkaRDD with starting offset being 0 and finish offset being a very large
number...
Sent from my iPhone
On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote:
I guess what you mean is not streaming.
Part of the issues is, when you read messages in a topic, the messages are
peeked, not polled, so there'll be no when the queue is empty, as I
understand it.
So it would seem I'd want to do KafkaUtils.createRDD, which takes an array
of OffsetRange's. Each OffsetRange is characterized by topic,
. You can then kill the job after the first batch. It's
possible you may be able to kill the job from a
StreamingListener.onBatchCompleted, but I've never tried and don't know
what the consequences may be.
On Wed, Apr 29, 2015 at 8:52 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote
Sean,
How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into a search engine. How do we configure our Spark job execution?
Right now, I'm seeing this job running as a single thread. And it's quite a
RDD is determined by the block
interval and the batch interval. If you have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.
On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded
then. If you run code on your driver,
it's not distributed. If you run Spark over an RDD with 1 partition,
only one task works on it.
On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,
How does this model actually work? Let's say we want to run one job as N
have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.
On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded code we already have..
How are these execution slots filled up? I
Thanks, Akhil. It looks like in the second example, for Rabbit they're
doing this: https://www.rabbitmq.com/mqtt.html.
On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I found two examples Java version
Thanks, Akhil. So what do folks typically do to increase/contract the capacity?
Do you plug in some cluster auto-scaling solution to make this elastic?
Does Spark have any hooks for instrumenting auto-scaling?
In other words, how do you avoid overwheling the receivers in a scenario when
your
Thanks so much, Yiannis, Olivier, Huang!
On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
Hi there,
I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives
the functionality you are looking for.
I haven't tested it
Dmytiis intial question – you can load large data sets as Batch
(Static) RDD from any Spark Streaming App and then join DStream RDDs
against them to emulate “lookups” , you can also try the “Lookup RDD” –
there is a git hub project
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent
” resources /
nodes which allows you to double, triple etc in the background/parallel the
resources of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg
Eftimov evo.efti...@isecc.com wrote:
Dmitry was concerned about the “serialization cost” NOT the “memory
footprint – hence option a) is still viable since a Broadcast is performed
only ONCE for the lifetime of Driver instance
*From:* Ted Yu [mailto:yuzhih...@gmail.com]
*Sent:* Wednesday, June 3
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger; Andrew
, spark streaming (spark) will NOT resort to disk –
and of course resorting to disk from time to time (ie when there is no free
RAM ) and taking a performance hit from that, BUT only until there is no
free RAM
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Thursday, May 28
the
resources of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger
in
awaitTermination. So what would be a way to trigger the termination in the
driver?
context.awaitTermination() allows the current thread to wait for the
termination of a context by stop() or by an exception - presumably, we
need to call stop() somewhere or perhaps throw.
Cheers,
- Dmitry
On Thu, Jun
unless you want it to.
If you want it, just call cache.
On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods
Great, thank you, Silvio. In your experience, is there any way to instument
a callback into Coda Hale or the Spark consumers from the metrics sink? If
the sink performs some steps once it has received the metrics, I'd like to
be able to make the consumers aware of that via some sort of a
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger; Andrew Or; Gerard
?
On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger c...@koeninger.org
wrote:
Depends on what you're reusing multiple times (if anything).
Read
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg
dgoldenberg
Date:2015/05/28 13:22 (GMT+00:00)
To: Dmitry Goldenberg
Cc: Gerard Maas ,spark users
Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
in Kafka or Spark's metrics?
You can always spin new boxes in the background and bring them into the
cluster fold when fully
to time (ie when there is no free
RAM ) and taking a performance hit from that, BUT only until there is no
free RAM
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Thursday, May 28, 2015 2:34 PM
*To:* Evo Eftimov
*Cc:* Gerard Maas; spark users
*Subject:* Re: FW: Re
Thank you, Tathagata, Cody, Otis.
- Dmitry
On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box
Got it, thank you, Tathagata and Ted.
Could you comment on my other question
http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html
as well? Basically, I'm trying to get a handle on a good
then are, a) will Spark sense the addition of a new node / is it sufficient
that the cluster manager is aware, then work just starts flowing there?
and b) what would be a way to gracefully remove a worker node when the
load subsides, so that no currently running Spark job is killed?
- Dmitry
On Thu, May 28
Thanks, Evo. Per the last part of your comment, it sounds like we will
need to implement a job manager which will be in control of starting the
jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
marking them as ones to relaunch, scaling the cluster up/down by
Eftimov
Date:2015/05/28 13:22 (GMT+00:00)
To: Dmitry Goldenberg
Cc: Gerard Maas ,spark users
Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
in Kafka or Spark's metrics?
You can always spin new boxes in the background and bring them into the
cluster fold when
it can
help.
Thanks,
Dmitry
On Sat, May 23, 2015 at 1:22 AM, Brant Seibert brantseib...@hotmail.com
wrote:
Hi, The healthcare industry can do wonderful things with Apache Spark.
But,
there is already a very large base of data and applications firmly rooted
in
the relational paradigm
wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in those directories nor am I seeing the
effects I'd expect from checkpointing. I'd expect any data that comes into
https://spark.apache.org/docs/latest/streaming-programming-guide.html
suggests it was intended to be a multiple of the batch interval. The
slide duration wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented
We're getting the below error. Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.
Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?
Thanks.
Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
a checkpoint whether we can estimate the
size of the checkpoint and compare with Runtime.getRuntime().freeMemory().
If the size of checkpoint is much bigger than free memory, log warning, etc
Cheers
On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Cody
, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote:
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size. Those would all need to be loaded at once.
On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote
. Hefty file system usage, hefty
memory consumption... What can we do to offset some of these costs?
On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org wrote:
The rdd is indeed defined by mostly just the offsets / topic partitions.
On Mon, Aug 10, 2015 at 3:24 PM, Dmitry
.
It sort of seems wrong though since
https://spark.apache.org/docs/latest/streaming-programming-guide.html
suggests it was intended to be a multiple of the batch interval. The
slide duration wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg
could
terminate as the last batch is being processed...
On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger c...@koeninger.org wrote:
You'll resume and re-process the rdd that didnt finish
On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Our additional question
N/m, these are just profiling snapshots :) Sorry for the wide distribution.
On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:
> We're seeing a bunch of .snapshot files being created under
> /home/spark/Snapshots, such as the following
We're seeing a bunch of .snapshot files being created under
/home/spark/Snapshots, such as the following for example:
CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
SparkSubmit-2015-08-31-shutdown-1.snapshot
Richard,
That's exactly the strategy I've been trying, which is a wrapper singleton
class. But I was seeing the inner object being created multiple times.
I wonder if the problem has to do with the way I'm processing the RDD's.
I'm using JavaDStream to stream data (from Kafka). Then I'm
My singletons do in fact stick around. They're one per worker, looks like.
So with 4 workers running on the box, we're creating one singleton per
worker process/jvm, which seems OK.
Still curious about foreachPartition vs. foreachRDD though...
On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher
These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.
Sean, different operations as they are, they can certainly be used on the
same data set. In that sense, they are alternatives. Code can be written
using one
it
'works' to call foreachRDD on an RDD?
@Dmitry are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do
Thanks, Cody. The good boy comment wasn't from me :) I was the one
asking for help.
On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote:
Sean already answered your question. foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient
Thanks, Akhil.
We're trying the conf.setExecutorEnv() approach since we've already got
environment variables set. For system properties we'd go the
conf.set(spark.) route.
We were concerned that doing the below type of thing did not work, which
this blog post seems to confirm (
"stuck" at 10 seconds.
Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4.
Thanks.
- Dmitry
;Starting data partition processing. AppName={}, topic={}.)...",
appName, topic);
// ... iterate ...
log.info("Finished data partition processing (appName={}, topic={}).
Documents processed: {}.", appName, topic, docCount);
}
Any ideas? Thanks.
- Dmitry
On Thu, Sep 3, 20
M, Tathagata Das <t...@databricks.com> wrote:
> Why are you checkpointing the direct kafka stream? It serves not purpose.
>
> TD
>
> On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I just disabled checkpointing in
gt;
>
> On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> That is good to know. However, that doesn't change the problem I'm
>> seeing. Which is that, even with that piece of code commented out
>> (stream.checkpoint()), th
; in each batch
>
> (which checkpoint is enabled using
> streamingContext.checkpoint(checkpointDir)) and can recover from failure by
> reading the exact same data back from Kafka.
>
>
> TD
>
> On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gma
t; On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Tathagata,
>>
>> Checkpointing is turned on but we were not recovering. I'm looking at the
>> logs now, feeding fresh content hours after the restart. Here's a snippet:
>&
jssc.start();
jssc.awaitTermination();
jssc.close();
On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das <t...@databricks.com> wrote:
> Are you sure you are not accidentally recovering from checkpoint? How are
> you using StreamingContext.getOrCreate() in your code?
>
> TD
>
> On F
t;() {
@Override
public Void call(JavaRDD rdd) throws Exception {
ProcessPartitionFunction func = new
ProcessPartitionFunction(params);
rdd.foreachPartition(func);
return null;
}
});
return jssc;
}
On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg &
eninger <c...@koeninger.org> wrote:
> Well, I'm not sure why you're checkpointing messages.
>
> I'd also put in some logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenber
logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I've stopped the jobs, the workers, and the master. Deleted the con
deleting or moving the contents of the checkpoint directory
> and restarting the job?
>
> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Sorry, more relevant code below:
>>
>> SparkConf sparkConf = createSparkConf(a
>> checkpoints can't be used between controlled restarts
Is that true? If so, why? From my testing, checkpoints appear to be working
fine, we get the data we've missed between the time the consumer went down
and the time we brought it back up.
>> If I cannot make checkpoints between code
erval at the time of
> recovery? Trying to understand your usecase.
>
>
> On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> >> when you use getOrCreate, and there exists a valid checkpoint, it will
>> always return the
Is there a way in Spark to automatically terminate laggard "stage's", ones
that appear to be hanging? In other words, is there a timeout for
processing of a given RDD?
In the Spark GUI, I see the "kill" function for a given Stage under
'Details for Job <...>".
Is there something in Spark that
correct numbers before killing any
>> job)
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Sep 14, 2015 at 10:40 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> Is there a way in Spark to automatically terminate
Is there a way to kill a laggard Spark job manually, and more importantly,
is there a way to do it programmatically based on a configurable timeout
value?
Thanks.
ith fewer partitions/replicas, see if it works.
>
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:37 PM
> To: Adrian Tanase
> Cc: "user@spark.apache.org"
> Subject: Re: Kafka error "partitions don't have a leader" /
> LeaderNo
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:26 PM
> To: "user@spark.apache.org"
> Subject: Kafka error "partitions don't have a leader" /
> LeaderNotAvailableException
>
> I apologize for posting this Kafka related issue
I apologize for posting this Kafka related issue into the Spark list. Have
gotten no responses on the Kafka list and was hoping someone on this list
could shed some light on the below.
---
We're running into
, you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers" --
you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers"
release of Spark
> command line for running Spark job
>
> Cheers
>
> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> We're seeing this occasionally. Granted, this was caused by a wrinkle in
>> the Solr schema but
We're seeing this occasionally. Granted, this was caused by a wrinkle in
the Solr schema but this bubbled up all the way in Spark and caused job
failures.
I just checked and SolrException class is actually in the consumer job jar
we use. Is there any reason why Spark cannot find the
I'm actually not sure how either one of these would possibly cause Spark to
find SolrException. Whether the driver or executor class path is first,
should it not matter, if the class is in the consumer job jar?
On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg <dgoldenberg...@gmail.
wrote:
> Have you tried the following ?
> --conf spark.driver.userClassPathFirst=true --conf spark.executor.
> userClassPathFirst=true
>
> On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Release of Spar
Thanks, Ted, will try it out.
On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> See the tail of this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1005811
>
> FYI
>
> > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
Is there a way to ensure Spark doesn't write to /tmp directory?
We've got spark.local.dir specified in the spark-defaults.conf file to
point at another directory. But we're seeing many of
these snappy-unknown-***-libsnappyjava.so files being written to /tmp still.
Is there a config setting or
f protobuf jar is loaded ahead of hbase-protocol.jar, things start to get
> interesting ...
>
> On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Ted, I think I have tried these settings with the hbase protocol jar, to
inting to override
its checkpoint duration millis, is there? Is the default there
max(batchdurationmillis, 10seconds)? Is there a way to override this?
Thanks.
On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com> wrote:
>
>
> See inline.
>
> On Tue, Sep
straight from the dev guide
>> val broadcastVar = ssc.sparkContext.broadcast(Array(1, 2, 3))
>>
>> //Try to print out the value of the broadcast var here
>> val transformed = events.transform(rdd => {
>> rdd.map(x => {
>> if(broadcastVar == null) {
>>
I'd guess that if the resources are broadcast Spark would put them into
Tachyon...
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tach
Jorn, you said Ignite or ... ? What was the second choice you were thinking of?
It seems that got omitted.
> On Jan 12, 2016, at 2:44 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> You can look at ignite as a HDFS cache or for storing rdds.
>
>> On 11 Jan 2016, at
nthan.com> wrote:
>
> One option could be to store them as blobs in a cache like Redis and then
> read + broadcast them from the driver. Or you could store them in HDFS and
> read + broadcast from the driver.
>
> Regards
> Sab
>
>> On Tue, Jan 12, 2016 at 1
ng <gene.p...@gmail.com> wrote:
> Hi Dmitry,
>
> Yes, Tachyon can help with your use case. You can read and write to
> Tachyon via the filesystem api (
> http://tachyon-project.org/documentation/File-System-API.html). There is
> a native Java API as well as a Hadoop-compatible API
in the same environment. For example, we use
virtualenv to run Spark with Python 2.7 and do not touch system Python 2.6.
Thank you,
Dmitry
09.01.2016, 06:36, "Sasha Kacanski" <skacan...@gmail.com>:
> +1
> Companies that use stock python in redhat 2.6 will need to upgrade or in
1 - 100 of 112 matches
Mail list logo