.
Regards,
Dibyendu
On Wed, May 13, 2015 at 7:35 PM, Cody Koeninger c...@koeninger.org
wrote:
As far as I can tell, Dibyendu's cons boil down to:
1. Spark checkpoints can't be recovered if you upgrade code
2. Some Spark transformations involve a shuffle, which can repartition
data
It's
I believe most ports are configurable at this point, look at
http://spark.apache.org/docs/latest/configuration.html
search for .port
On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com wrote:
I understated that this port value is randomly selected.
Is there a way to enforce
:
Hi Cody,
If you are so sure, can you share a bench-marking (which you ran for days
maybe?) that you have done with Kafka APIs provided by Spark?
Thanks
Best Regards
On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger c...@koeninger.org
wrote:
I don't think it's accurate for Akhil to claim
they arrived after the driver reconnected
to Kafka
Is this what happens by default in your suggestion?
On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger c...@koeninger.org
wrote:
I don't think it's accurate for Akhil to claim that the linked library is
much more flexible/reliable than
I don't think it's accurate for Akhil to claim that the linked library is
much more flexible/reliable than what's available in Spark at this point.
James, what you're describing is the default behavior for the
createDirectStream api available as part of spark since 1.3. The kafka
parameter
Make sure to read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The directStream / KafkaRDD has a 1 : 1 relationship between kafka
topic/partition and spark partition. So a given spark partition only has
messages from 1 kafka topic. You can tell what topic that is
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
The arguments are sql string, lower bound, upper bound, number of
partitions.
Your call SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1
would thus be run as
SELECT * FROM MEMBERS LIMIT 0 OFFSET 100
it is not related to the data ingestion part.
On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger c...@koeninger.org
wrote:
Use lsof to see what files are actually being held open.
That stacktrace looks to me like it's from the driver, not executors.
Where in foreach is it being called
Hadoop version doesn't matter if you're just using cassandra.
On Wed, Apr 29, 2015 at 12:08 PM, Matthew Johnson matt.john...@algomi.com
wrote:
Hi all,
I am new to Spark, but excited to use it with our Cassandra cluster. I
have read in a few places that Spark can interact directly with
The idea of peek vs poll doesn't apply to kafka, because kafka is not a
queue.
There are two ways of doing what you want, either using KafkaRDD or a
direct stream
The Kafka rdd approach would require you to find the beginning and ending
offsets for each partition. For an example of this, see
Use lsof to see what files are actually being held open.
That stacktrace looks to me like it's from the driver, not executors.
Where in foreach is it being called? The outermost portion of foreachRDD
runs in the driver, the innermost portion runs in the executors. From the
docs:
As far as I know, createStream doesn't let you specify where receivers are
run.
createDirectStream in 1.3 doesn't use long-running receivers, so it is
likely to give you more even distribution of consumers across your workers.
On Mon, Apr 13, 2015 at 11:31 AM, Laeeq Ahmed
Connection pools aren't serializable, so you generally need to set them up
inside of a closure. Doing that for every item is wasteful, so you
typically want to use mapPartitions or foreachPartition
rdd.mapPartition { part =
setupPool
part.map { ...
See Design Patterns for using foreachRDD in
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The kafka consumers run in the executors.
On Wed, Apr 1, 2015 at 11:18 AM, Neelesh neele...@gmail.com wrote:
With receivers, it was pretty obvious which code ran where - each receiver
occupied a core and ran on the
once, and no way of
refreshing them.
Thanks again!
On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger c...@koeninger.org
wrote:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The kafka consumers run in the executors.
On Wed, Apr 1, 2015 at 11:18 AM, Neelesh neele
at 11:21 AM, Cody Koeninger c...@koeninger.org
wrote:
If you want to change topics from batch to batch, you can always just
create a KafkaRDD repeatedly.
The streaming code as it stands assumes a consistent set of topics
though. The implementation is private so you cant subclass it without
This line
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(
KafkaRDD.scala:158)
is the attempt to close the underlying kafka simple consumer.
We can add a null pointer check, but the underlying issue of the consumer
being null probably indicates a problem earlier. Do you see
Have you tried instantiating the instance inside the closure, rather than
outside of it?
If that works, you may need to switch to use mapPartition /
foreachPartition for efficiency reasons.
On Mon, Mar 23, 2015 at 3:03 PM, Adelbert Chang adelbe...@gmail.com wrote:
Is there no way to pull out
KafkaUtils.createDirectStream, added in spark 1.3, will let you specify a
particular topic and partition
On Thu, Mar 12, 2015 at 1:07 PM, Colin McQueen
colin.mcqu...@shiftenergy.com wrote:
Thanks! :)
Colin McQueen
*Software Developer*
On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele
Have you already tried using the Vertica hadoop input format with spark? I
don't know how it's implemented, but I'd hope that it has some notion of
vertica-specific shard locality (which JdbcRDD does not).
If you're really constrained to consuming the result set in a single
thread, whatever
I'm a little confused by your comments regarding LIMIT. There's nothing
about JdbcRDD that depends on limit. You just need to be able to partition
your data in some way such that it has numeric upper and lower bounds.
Primary key range scans, not limit, would ordinarily be the best way to do
batch? To compare with storm
from a message ordering point of view, unless a tuple is fully processed by
the DAG (as defined by spout+bolts), the next tuple does not enter the DAG.
On Thu, Feb 19, 2015 at 9:47 PM, Cody Koeninger c...@koeninger.org
wrote:
Kafka ordering is guaranteed on a per
, it breaks our requirement that messages
be executed in order within a partition.
Thanks!
On Fri, Feb 20, 2015 at 7:03 AM, Cody Koeninger c...@koeninger.org
wrote:
For a given batch, for a given partition, the messages will be processed
in order by the executor that is running that partition
, 'no upper bound' (-1 didn't
work).
On Wed, Feb 18, 2015 at 11:59 PM, Cody Koeninger c...@koeninger.org
wrote:
Look at the definition of JdbcRDD.create:
def create[T](
sc: JavaSparkContext,
connectionFactory: ConnectionFactory,
sql: String,
lowerBound: Long
Kafka ordering is guaranteed on a per-partition basis.
The high-level consumer api as used by the spark kafka streams prior to 1.3
will consume from multiple kafka partitions, thus not giving any ordering
guarantees.
The experimental direct stream in 1.3 uses the simple consumer api, and
there
Take a look at
https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java
On Wed, Feb 18, 2015 at 11:14 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:
I'm reading data from a database using JdbcRDD, in Java, and I have an
implementation of
is defined as public static class DbConn extends
AbstractFunction0Connection implements Serializable
On Wed, Feb 18, 2015 at 1:20 PM, Cody Koeninger c...@koeninger.org
wrote:
That test I linked
https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java
-data-into-spark-using-jdbcrdd-in-java/.
It got around any of the compilation issues but then I got the runtime
error where Spark wouldn't recognize the db connection class as a
scala.Function0.
On Wed, Feb 18, 2015 at 12:37 PM, Cody Koeninger c...@koeninger.org
wrote:
Take a look at
https
to refactor out the custom Function classes such as
the one for getting a db connection or mapping ResultSet data to your own
POJO's rather than doing it all inline?
On Wed, Feb 18, 2015 at 1:52 PM, Cody Koeninger c...@koeninger.org
wrote:
Is sc there a SparkContext or a JavaSparkContext
outdata.foreachRDD( rdd = rdd.foreachPartition(rec = {
val writer = new
KafkaOutputService(otherConf(kafkaProducerTopic).toString, propsMap)
writer.output(rec)
}) )
So this is creating a new kafka producer for every new
That PR hasn't been updated since the new kafka streaming stuff (including
KafkaCluster) got merged to master, it will require more changes than
what's in there currently.
On Tue, Feb 10, 2015 at 9:25 AM, Sean Owen so...@cloudera.com wrote:
Yes, did you see the PR for SPARK-2808?
Take a look at the implementation linked from here
https://issues.apache.org/jira/browse/SPARK-4964
see if that would meet your needs
On Wed, Jan 14, 2015 at 9:58 PM, mykidong mykid...@gmail.com wrote:
Hi,
My Spark Streaming Job is doing like kafka etl to HDFS.
For instance, every 10 min.
Look at the method pipe
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
On Wed, Jan 14, 2015 at 11:16 PM, umanga bistauma...@gmail.com wrote:
This is question i originally asked in Quora: http://qr.ae/6qjoI
We have some code written in C++ and Python that
If you don't care about the value that your map produced (because you're
not already collecting or saving it), then is foreach more appropriate to
what you're doing?
On Mon, Jan 12, 2015 at 4:08 AM, kevinkim kevin...@apache.org wrote:
Hi, answer from another Kevin.
I think you may already
You should take a look at
https://issues.apache.org/jira/browse/SPARK-4122
which is implementing writing to kafka in a pretty similar way (make a new
producer inside foreachPartition)
On Mon, Jan 12, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
Leader-not-found suggests a problem with
At a quick glance, I think you're misunderstanding some basic features.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Map is a transformation, it is lazy. You're not calling any action on the
result of map.
Also, closing over a mutable variable (like idx or
http://spark.apache.org/docs/latest/monitoring.html
http://spark.apache.org/docs/latest/configuration.html#spark-ui
spark.eventLog.enabled
On Mon, Jan 12, 2015 at 3:00 PM, ChongTang ct...@virginia.edu wrote:
Is there any body can help me with this? Thank you very much!
--
View this
enabled this option, and I saved logs
into Hadoop file system. The problem is, how can I get the duration of an
application? The attached file is the log I copied from HDFS.
On Mon, Jan 12, 2015 at 4:36 PM, Cody Koeninger c...@koeninger.org
wrote:
http://spark.apache.org/docs/latest
the highest priority.
Alex
On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger c...@koeninger.org
javascript:_e(%7B%7D,'cvml','c...@koeninger.org'); wrote:
http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
Setting a high weight such as 1000 also makes it possible
http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
Setting a high weight such as 1000 also makes it possible to implement
*priority* between pools—in essence, the weight-1000 pool will always get
to launch tasks first whenever it has jobs active.
On Sat, Jan 10,
But Xuelin already posted in the original message that the code was using
SET spark.sql.parquet.filterPushdown=true
On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com wrote:
Quoting Michael:
Predicate push down into the input format is turned off by default because
there is
General ideas regarding too many open files:
Make sure ulimit is actually being set, especially if you're on mesos
(because of https://issues.apache.org/jira/browse/MESOS-123 ) Find the pid
of the executor process, and cat /proc/pid/limits
set spark.shuffle.consolidateFiles = true
try
No, most rdds partition input data appropriately.
On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
wrote:
One more question, to be clarify. Will every node pull in all the data ?
thanks
On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
wrote
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes if there is idle capacity.
see spark.locality.wait
http://spark.apache.org/docs/latest/configuration.html
On Tue, Jan 6, 2015 at 10:17 AM, gtinside gtins...@gmail.com wrote:
Does spark
I haven't tried it with spark specifically, but I've definitely run into
problems trying to depend on multiple versions of akka in one project.
On Sat, Jan 3, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote:
hey Ted,
i am aware of the upgrade efforts for akka. however if spark 1.2
If you are not co-locating spark executor processes on the same machines
where the data is stored, and using an rdd that knows about which node to
prefer scheduling a task on, yes, the data will be pulled over the network.
Of the options you listed, S3 and DynamoDB cannot have spark running on
That sounds slow to me.
It looks like your sql query is grouping by a column that isn't in the
projections, I'm a little surprised that even works. But you're getting
the same time reducing manually?
Have you looked at the shuffle amounts in the UI for the job? Are you
certain there aren't a
JavaDataBaseConnectivity is, as far as I know, JVM specific. The JdbcRDD
is expecting to deal with Jdbc Connection and ResultSet objects.
I haven't done any python development in over a decade, but if someone
wants to work together on a python equivalent I'd be happy to help out.
The original
I'm not sure exactly what you're trying to do, but take a look at
rdd.toLocalIterator if you haven't already.
On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
collect()-ing a partition still implies copying it to the driver, but
you're suggesting you can't collect() the
There are hardware recommendations at
http://spark.apache.org/docs/latest/hardware-provisioning.html but they're
overkill for just testing things out. You should be able to get meaningful
work done with 2 m3large for instance.
On Sat, Dec 27, 2014 at 8:27 AM, Amy Brown testingwithf...@gmail.com
Do you actually need spark streaming per se for your use case? If you're
just trying to read data out of kafka into hbase, would something like this
non-streaming rdd work for you:
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
Note
For an alternative take on a similar idea, see
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.
I
601 - 652 of 652 matches
Mail list logo