Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote:Hi Craig,Can you please
Code is always distributed for any operations on a DataFrame or RDD. The size
of your code is irrelevant except to Jvm memory limits. For most jobs the
entire application jar and all dependencies are put on the classpath of every
executor.
There are some exceptions but generally you should
There are a few things going on here.
1. Spark is lazy, so nothing happens until a result is collected back to the
driver or data is written to a sink. So the 1 line you see
is most likely just that trigger. Once triggered, all of the work required to
make that final result happen occurs. If
This is probably because your data size is well under the broadcastJoin
threshold so at the planning phase it decides to do a BroadcastJoin instead of
a Join which could take advantage of dynamic partition pruning. For testing
like this you can always disable that with
As Sean said I believe you want to be setting
spark.ui.retainedJobs 1000How many jobs the Spark UI and status APIs
remember before garbage collecting. This is a target maximum, and fewer
elements may be retained in some circumstances. 1.2.0
spark.ui.retainedStages 1000How many
Could also be transient object being referenced from within the custom code.
When serialized the reference shows up as null even though you had set it in
the parent object.
> On Jun 30, 2021, at 4:44 PM, Sean Owen wrote:
>
> The error is in your code, which you don't show. You are almost
Is there any Java/Python code available?
>
> Best Regards,
> Eduardus Hardika Sandy Atmaja
> From: Russell Spitzer
> Sent: Monday, June 28, 2021 8:28 PM
> To: Eduardus Hardika Sandy Atmaja
> Cc: user@spark.apache.org
> Subject: Re: Request for FP-Growth source code
&g
https://github.pie.apple.com/IPR/apache-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
This?
On Mon, Jun 28, 2021 at 5:11 AM Eduardus Hardika Sandy Atmaja
wrote:
> Dear Apache Spark Admin
>
> Hello, my name is Edo. I am a Ph.D. student from India. Now I am
Callable means you tried to treat a field as a function like in the
following example
>>> fun = True
>>> fun()
Traceback (most recent call last):
File "", line 1, in
TypeError: 'bool' object is not callable
My guess is that "isStreaming" is a bool, and in your syntax you used it as
a function
ich IP or hostname of data-nodes returns
> from name-node to the spark? or Can you offer me a debug approach?
>
>> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer
>> mailto:russell.spit...@gmail.com>> wrote:
>>
>> Data locality can only occur if the Spar
Data locality can only occur if the Spark Executor IP address string matches
the preferred location returned by the file system. So this job would only have
local tasks if the datanode replicas for the files in question had the same ip
address as the Spark executors you are using. If they don't
Could be that the driver JVM cannot handle the metadata required to store
the partition information of a 70k partition RDD. I see you say you have a
100GB driver but i'm not sure where you configured that?
Did you set --driver-memory 100G ?
On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD
Yep this is the behavior for Insert Into, using the other write apis does
schema matching I believe.
> On Mar 3, 2021, at 8:29 AM, Sean Owen wrote:
>
> I don't have any good answer here, but, I seem to recall that this is because
> of SQL semantics, which follows column ordering not naming
The feature you are looking for is called "String Interpolation" and is
available in python 3.6. It uses a different syntax than scala's
https://www.programiz.com/python-programming/string-interpolation
On Mon, Dec 7, 2020 at 7:05 AM Mich Talebzadeh
wrote:
> In Spark/Scala you can use 's'
The general exceptions here mean that components within the Spark cluster
can't communicate. The most common cause for this is failures of the
processors that are supposed to be communicating. I generally see this when
one of the processes goes into a GC storm or is shut down because of an
Well if the system doesn't change, then the data must be different. The
exact exception probably won't be helpful since it only tells us the last
allocation that failed. My guess is that your ingestion changed and there
is either now slightly more data than previously or it's skewed
differently.
--driver-class-path does not move jars, so it is dependent on your Spark
resource manager (master). It is interpreted literally so if your files do
not exist in the location you provide relative where the driver is run,
they will not be placed on the classpath.
Since the driver is responsible for
A where clause with a PK restriction should be identified by the Connector
and transformed into a single request. This should still be much slower
than doing the request directly but still much much faster than a full scan.
On Wed, Nov 4, 2020 at 12:51 PM Russell Spitzer
wrote:
>
Yes, the "Allow filtering" part isn't actually important other than for
letting the query run in the first place. A where clause that utilizes a
clustering column restriction will perform much better than a full scan,
column pruning as well can be extremely beneficial.
On Wed, Nov 4, 2020 at
--jar Adds only that jar
--package adds the Jar and a it's dependencies listed in maven
On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh
wrote:
> Hi,
>
> I have a scenario that I use in Spark submit as follows:
>
> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>
any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
&g
As long as you don't use python lambdas in your Spark job there should be
almost no difference between the Scala and Python dataframe code. Once you
introduce python lambdas you will hit some significant serialization
penalties as well as have to run actual work code in python. As long as no
You can't use df as the name of the return from the try and the name of the
match variable in success. You also probably want to match the name of the
variable in the match with the return from the match.
So
val df = Try(spark.read.
format("jdbc").
option("url", jdbcUrl).
> doesn't have to do anything with previous MapPartitionsRDD[7] at map at
> Job.scala:112
>
> ср, 19 авг. 2020 г. в 16:01, Russell Spitzer :
>
>> Checkpoint is lazy and needs an action to actually do the work. The
>> method just marks the rdd as noted in the doc you pos
/core/src/main/scala/org/apache/spark/SparkContext.scala#L1842
Which places local jars on the driver hosted file server and just leaves
Remote Jars as is with the path for executors to access them
On Thu, Aug 13, 2020 at 11:01 PM Russell Spitzer
wrote:
> The driver hosts a file server wh
The driver hosts a file server which the executors download the jar from.
On Thu, Aug 13, 2020, 5:33 PM James Yu wrote:
> Hi,
>
> When I spark submit a Spark app with my app jar located in S3, obviously
> the Driver will download the jar from the s3 location. What is not clear
> to me is:
nk you so much.
>
> Can you elaborate on the differences between structured streaming vs
> dstreams? How would the number of receivers required etc change?
>
> On Sat, 8 Aug, 2020, 10:28 pm Russell Spitzer,
> wrote:
>
>> Note, none of this applies to Direct streaming a
Note, none of this applies to Direct streaming approaches, only receiver
based Dstreams.
You can think of a receiver as a long running task that never finishes.
Each receiver is submitted to an executor slot somewhere, it then runs
indefinitely and internally has a method which passes records
That's a bad error message. Basically you can't make a spark native catalog
reference for a dsv2 source. You have to use that Datasources catalog or
use the programmatic API. Both dsv1 and dsv2 programattic apis work (plus
or minus some options)
On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn wrote:
You do not need one spark session per cluster.
Spark SQL with Datasource v1
http://www.russellspitzer.com/2016/02/16/Multiple-Clusters-SparkSql-Cassandra/
DatasourceV2
Would require making two catalog references then copying between them
Usually this is just the sign that one of the executors quit unexpectedly
which explains the dead executors you see in the ui. The next step is
usually to go and look at those executor logs and see if there's any reason
for the termination. if you end up seeing an abrupt truncation of the log
that
High GC is relatively hard to debug in general but I can give you a few
pointers. This basically means that the time spent cleaning up unused
objects is high which usually means memory is be used and thrown away
rapidly. It can also mean that GC is ineffective, and is being run many
times in an
The code you linked to is very old and I don't think that method works
anymore (Hive context not existing anymore). My latest attempt at trying
this was Spark 2.2 and I ran into the issues I wrote about before.
In DSV2 it's done via a catalog implementation, so you basically can write
a new
Without seeing the rest (and you can confirm this by looking at the DAG
visualization in the Spark UI) I would say your first stage with 6
partitions is:
Stage 1: Read data from kinesis (or read blocks from receiver not sure what
method you are using) and write shuffle files for repartition
Stage
Without your code this is hard to determine but a few notes.
The number of partitions is usually dictated by the input source, see if it
has any configuration which allows you to increase input splits.
I'm not sure why you think some of the code is running on the driver. All
methods on
The last I looked into this the answer is no. I believe since there is a
Spark Session internal relation cache, the only way to update a sessions
information was a full drop and create. That was my experience with a
custom hive metastore and entries read from it. I could change the entries
in the
Jul 2020 at 12:05, Jeff Evans
> wrote:
>
>> If you can't avoid it, you need to make use of the
>> spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst
>> properties.
>>
>> On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer <
>> russell.spit...@gma
I believe the main issue here is that spark.jars is a bit "too late" to
actually prepend things to the class path. For most use cases this value is
not read until after the JVM has already started and the system classloader
has already loaded.
The jar argument gets added via the dynamic class
I'm not sure what you're really trying to do here but it sounds like saving
the data to a park a file or other temporary store before truncating would
protect you in case of failure.
On Wed, Jul 1, 2020, 9:48 AM Amit Sharma wrote:
> Hi, i have scenario where i have to read certain raw from a
The connector uses Java driver cql request under the hood which means it
responds to the changing database like a normal application would. This
means retries may result in a different set of data than the original
request if the underlying database changed.
On Fri, Jun 26, 2020, 9:42 PM Jungtaek
The error is in the Spark Standalone Worker. It's hitting an OOM while
launching/running an executor process. Specifically it's running out of
memory when parsing the hadoop configuration trying to figure out the
env/command line to run
I do not see any conflict. I'm not sure what the exact worry of
infringement is based on? The Apache license does not restrict anyone from
writing a book about a project.
On Sat, Apr 25, 2020, 10:48 AM Som Lima wrote:
> The text is very clear on the issue of copyright infringement. Ask
>
So one thing to know here is that all Java applications are going to use
many threads, and just because your particular main method doesn't spawn
additional threads doesn't mean library you access or use won't spawn
additional threads. The other important note is that Spark doesn't actually
equate
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/InputPartition.java
See InputPartition which had a preferred location parameter you should
override
On Sat, Jan 18, 2020, 1:44 PM kineret M wrote:
> Hi,
> I would like to support data
Is there a chance your data is all even or all odd?
On Tue, Dec 17, 2019 at 11:01 AM Tzahi File wrote:
> I have in my spark sql query a calculated field that gets the value if
> field1 % 3.
>
> I'm using this field as a partition so I expected to get 3 partitions in
> the mentioned case, and I
My guess would be this is a Spark Version mismatch. The option is added
https://github.com/apache/spark/commit/df9a50637e2622a15e9af7d837986a0e868878b1
https://issues.apache.org/jira/browse/SPARK-27453
In 2.4.2
I would make sure your Spark installs are all 2.4.4 and that your code is
compiled
Spark can use the HiveMetastore as a catalog, but it doesn't use the hive
parser or optimization engine. Instead it uses Catalyst, see
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
On Mon, Jun 10, 2019 at 2:07 PM naresh Goud
wrote:
> Hi Team,
>
> Is
ts of the plan refer to static pieces of
>> data ..."* Could you elaborate a bit more on what does this static
>> piece of data refer to? Are you referring to the 10 records that had
>> already arrived at T1 and are now sitting as old static data in the
>> unbounded t
Dataframes describe the calculation to be done, but the underlying
implementation is an "Incremental Query". That is that the dataframe code
is executed repeatedly with Catalyst adjusting the final execution plan on
each run. Some parts of the plan refer to static pieces of data, others
refer to
2.4.3 Binary is out now and they did change back to 2.11.
https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
On Mon, May 6, 2019 at 9:21 PM Russell Spitzer
wrote:
> Spark 2.4.2 was incorrectly released with the default package binaries set
> to Scal
Actually i just checked the release, they only changed the pyspark part. So
the download on the website will still be 2.12 so you'll need to build the
scala 2.11 version of Spark if you want to use the connector. Or Submit a
PR for scala 2.12 support
On Mon, May 6, 2019 at 9:21 PM Russell Spitzer
oo.com/?.src=iOS>
>
> On Monday, May 6, 2019, 18:34, Russell Spitzer
> wrote:
>
> Scala version mismatched
>
> Spark is shown at 2.12, the connector only has a 2.11 release
>
>
>
> On Mon, May 6, 2019, 7:59 PM Richard Xin
> wrote:
>
>
> org.apach
Scala version mismatched
Spark is shown at 2.12, the connector only has a 2.11 release
On Mon, May 6, 2019, 7:59 PM Richard Xin
wrote:
>
> org.apache.spark
> spark-core_2.12
> 2.4.0
> compile
>
>
> org.apache.spark
> spark-sql_2.12
> 2.4.0
>
>
>
Run an "explain" instead of show, i'm betting it's casting tier_id to a
small_int to do the comparison
On Wed, Feb 6, 2019 at 9:31 AM Artur Sukhenko
wrote:
> Hello guys,
> I am migrating from Spark 1.6 to 2.2 and have this issue:
> I am casting string to short and comparing them with equal .
>
Yes, Scala or Java.
No. Once you have written the implementation it is valid in all df apis.
As for examples there are many, check the Kafka source in tree or one of
the many sources listed on the spark packages website.
On Thu, Aug 30, 2018, 8:23 PM Ramaswamy, Muthuraman <
The answer is most likely that when you use Cross Java - Python code you
incur a penalty for every objects that you transform from a Java object
into a Python object (and then back again to a Python object) when data is
being passed in and out of your functions. A way around this would probably
be
You don't have to go through hive. It's just spark sql. The application is
just a forked hive thrift server.
On Fri, Aug 11, 2017 at 8:53 PM kant kodali wrote:
> @Ryan it looks like if I enable thrift server I need to go through hive. I
> was talking more about having JDBC
The scc includes the java driver. Which means you could just use java
driver functions. It also provides a serializable wrapper which has session
and prepared statement pooling. Something like
val cc = CassandraConnector(sc.getConf)
SomeFunctionWithAnIterator{
it: SomeIterator =>
Sorry if this is a double post, wasn't sure if I got through on my
forwarding.
I mentioned this in the RC2 note for 2.2.0 of Spark and i'm seeing it now
on the official release. Running the Spark Casasnadra Connector integration
tests for the SCC now fail whenever trying to do something involving
I mentioned this in the RC2 note for 2.2.0 of Spark and i'm seeing it now
on the official release. Running the Spark Casasnadra Connector integration
tests for the SCC now fail whenever trying to do something involving the
CassandraSource being transformed into the DataSourceScanExec SparkPlan.
=> println(it.length))
1
0
1
1
0
0
0
0
If we actually shuffle the data using the hash partitioner (using the
repartition command) we get the expected results.
scala> rdd1.repartition(8).foreachPartition(it => println(it.length))
0
0
0
0
0
0
0
3
On Sat, Jun 24, 2017 at 12:22 PM R
Neither of your code examples invoke a repartitioning. Add in a repartition
command.
On Sat, Jun 24, 2017, 11:53 AM Vikash Pareek
wrote:
> Hi Vadim,
>
> Thank you for your response.
>
> I would like to know how partitioner choose the key, If we look at my
>
Great catch Anastasios!
On Fri, Feb 17, 2017 at 9:59 AM Anastasios Zouzias
wrote:
> Hey,
>
> Can you try with the 2.11 spark-cassandra-connector? You just reported
> that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar
>
> Best,
> Anastasios
>
> On Fri, Feb 17, 2017 at
You can treat Oracle as a JDBC source (
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip
hive on the way back out (see the same link) and write directly to Oracle.
I'll leave the
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
Has all the information about Dataframes/ SparkSql
On Fri, Nov 11, 2016 at 8:52 AM kant kodali wrote:
> Wait I cannot create CassandraSQLContext from spark-shell. is this only
> for
any conflicts with cached or older
> versions. I only have SPARK_HOME environment variable set (env variables
> related to Spark and Python).
>
> --
> *From:* Russell Spitzer <russell.spit...@gmail.com>
> *To:* Trivedi Amit <amit_...@yahoo.com>; "user@
Spark 2.0 defaults to Scala 2.11, so if you didn't build it yourself you
need the 2.11 artifact for the Spark Cassandra Connector.
On Wed, Sep 14, 2016 at 7:44 PM Trivedi Amit
wrote:
> Hi,
>
>
>
> I am testing a pyspark program that will read from a csv file and
This would also be a better question for the SCC user list :)
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
On Sun, Sep 4, 2016 at 9:31 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:
>
> https://github.com/datastax/spark-cassandra-connector
https://github.com/datastax/spark-cassandra-connector/blob/v1.3.1/doc/14_data_frames.md
In Spark 1.3 it was illegal to use "table" as a key in Spark SQL so in that
version of Spark the connector needed to use the option "c_table"
val df = sqlContext.read.
|
Cassandra does not differentiate between null and empty, so when reading
from C* all empty values are reported as null. To avoid inserting nulls
(avoiding tombstones) see
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#globally-treating-all-nulls-as-unset
This
Spark streaming does not process 1 event at a time which is in general I
think what people call "Streaming." It instead processes groups of events.
Each group is a "MicroBatch" that gets processed at the same time.
Streaming theoretically always has better latency because the event is
processed
71 matches
Mail list logo