Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote:Hi Craig,Can you please

Re: Spark Doubts

2022-06-25 Thread russell . spitzer
Code is always distributed for any operations on a DataFrame or RDD. The size of your code is irrelevant except to Jvm memory limits. For most jobs the entire application jar and all dependencies are put on the classpath of every executor. There are some exceptions but generally you should

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Russell Spitzer
There are a few things going on here. 1. Spark is lazy, so nothing happens until a result is collected back to the driver or data is written to a sink. So the 1 line you see is most likely just that trigger. Once triggered, all of the work required to make that final result happen occurs. If

Re: [Spark CORE][Spark SQL][Advanced]: Why dynamic partition pruning optimization does not work in this scenario?

2021-12-04 Thread Russell Spitzer
This is probably because your data size is well under the broadcastJoin threshold so at the planning phase it decides to do a BroadcastJoin instead of a Join which could take advantage of dynamic partition pruning. For testing like this you can always disable that with

Re: Possibly a memory leak issue in Spark

2021-09-22 Thread Russell Spitzer
As Sean said I believe you want to be setting spark.ui.retainedJobs 1000How many jobs the Spark UI and status APIs remember before garbage collecting. This is a target maximum, and fewer elements may be retained in some circumstances. 1.2.0 spark.ui.retainedStages 1000How many

Re: Spark Null Pointer Exception

2021-06-30 Thread Russell Spitzer
Could also be transient object being referenced from within the custom code. When serialized the reference shows up as null even though you had set it in the parent object. > On Jun 30, 2021, at 4:44 PM, Sean Owen wrote: > > The error is in your code, which you don't show. You are almost

Re: Request for FP-Growth source code

2021-06-28 Thread Russell Spitzer
Is there any Java/Python code available? > > Best Regards, > Eduardus Hardika Sandy Atmaja > From: Russell Spitzer > Sent: Monday, June 28, 2021 8:28 PM > To: Eduardus Hardika Sandy Atmaja > Cc: user@spark.apache.org > Subject: Re: Request for FP-Growth source code &g

Re: Request for FP-Growth source code

2021-06-28 Thread Russell Spitzer
https://github.pie.apple.com/IPR/apache-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala This? On Mon, Jun 28, 2021 at 5:11 AM Eduardus Hardika Sandy Atmaja wrote: > Dear Apache Spark Admin > > Hello, my name is Edo. I am a Ph.D. student from India. Now I am

Re: Spark Structured Streaming 'bool' object is not callable, quitting

2021-04-21 Thread Russell Spitzer
Callable means you tried to treat a field as a function like in the following example >>> fun = True >>> fun() Traceback (most recent call last): File "", line 1, in TypeError: 'bool' object is not callable My guess is that "isStreaming" is a bool, and in your syntax you used it as a function

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
ich IP or hostname of data-nodes returns > from name-node to the spark? or Can you offer me a debug approach? > >> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer >> mailto:russell.spit...@gmail.com>> wrote: >> >> Data locality can only occur if the Spar

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
Data locality can only occur if the Spark Executor IP address string matches the preferred location returned by the file system. So this job would only have local tasks if the datanode replicas for the files in question had the same ip address as the Spark executors you are using. If they don't

Re: possible bug

2021-04-08 Thread Russell Spitzer
Could be that the driver JVM cannot handle the metadata required to store the partition information of a 70k partition RDD. I see you say you have a 100GB driver but i'm not sure where you configured that? Did you set --driver-memory 100G ? On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Russell Spitzer
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe. > On Mar 3, 2021, at 8:29 AM, Sean Owen wrote: > > I don't have any good answer here, but, I seem to recall that this is because > of SQL semantics, which follows column ordering not naming

Re: substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Russell Spitzer
The feature you are looking for is called "String Interpolation" and is available in python 3.6. It uses a different syntax than scala's https://www.programiz.com/python-programming/string-interpolation On Mon, Dec 7, 2020 at 7:05 AM Mich Talebzadeh wrote: > In Spark/Scala you can use 's'

Re: Spark Exception

2020-11-20 Thread Russell Spitzer
The general exceptions here mean that components within the Spark cluster can't communicate. The most common cause for this is failures of the processors that are supposed to be communicating. I generally see this when one of the processes goes into a GC storm or is shut down because of an

Re: Out of memory issue

2020-11-20 Thread Russell Spitzer
Well if the system doesn't change, then the data must be different. The exact exception probably won't be helpful since it only tells us the last allocation that failed. My guess is that your ingestion changed and there is either now slightly more data than previously or it's skewed differently.

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Russell Spitzer
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. Since the driver is responsible for

Re: Spark reading from cassandra

2020-11-04 Thread Russell Spitzer
A where clause with a PK restriction should be identified by the Connector and transformed into a single request. This should still be much slower than doing the request directly but still much much faster than a full scan. On Wed, Nov 4, 2020 at 12:51 PM Russell Spitzer wrote: >

Re: Spark reading from cassandra

2020-11-04 Thread Russell Spitzer
Yes, the "Allow filtering" part isn't actually important other than for letting the query run in the first place. A where clause that utilizes a clustering column restriction will perform much better than a full scan, column pruning as well can be extremely beneficial. On Wed, Nov 4, 2020 at

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Russell Spitzer
--jar Adds only that jar --package adds the Jar and a it's dependencies listed in maven On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh wrote: > Hi, > > I have a scenario that I use in Spark submit as follows: > > spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars >

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > &g

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Russell Spitzer
You can't use df as the name of the return from the try and the name of the match variable in success. You also probably want to match the name of the variable in the match with the return from the match. So val df = Try(spark.read. format("jdbc"). option("url", jdbcUrl).

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Russell Spitzer
> doesn't have to do anything with previous MapPartitionsRDD[7] at map at > Job.scala:112 > > ср, 19 авг. 2020 г. в 16:01, Russell Spitzer : > >> Checkpoint is lazy and needs an action to actually do the work. The >> method just marks the rdd as noted in the doc you pos

Re: Where do the executors get my app jar from?

2020-08-13 Thread Russell Spitzer
/core/src/main/scala/org/apache/spark/SparkContext.scala#L1842 Which places local jars on the driver hosted file server and just leaves Remote Jars as is with the path for executors to access them On Thu, Aug 13, 2020 at 11:01 PM Russell Spitzer wrote: > The driver hosts a file server wh

Re: Where do the executors get my app jar from?

2020-08-13 Thread Russell Spitzer
The driver hosts a file server which the executors download the jar from. On Thu, Aug 13, 2020, 5:33 PM James Yu wrote: > Hi, > > When I spark submit a Spark app with my app jar located in S3, obviously > the Driver will download the jar from the s3 location. What is not clear > to me is:

Re: Spark streaming receivers

2020-08-10 Thread Russell Spitzer
nk you so much. > > Can you elaborate on the differences between structured streaming vs > dstreams? How would the number of receivers required etc change? > > On Sat, 8 Aug, 2020, 10:28 pm Russell Spitzer, > wrote: > >> Note, none of this applies to Direct streaming a

Re: Spark streaming receivers

2020-08-08 Thread Russell Spitzer
Note, none of this applies to Direct streaming approaches, only receiver based Dstreams. You can think of a receiver as a long running task that never finishes. Each receiver is submitted to an executor slot somewhere, it then runs indefinitely and internally has a method which passes records

Re: DataSource API v2 & Spark-SQL

2020-08-03 Thread Russell Spitzer
That's a bad error message. Basically you can't make a spark native catalog reference for a dsv2 source. You have to use that Datasources catalog or use the programmatic API. Both dsv1 and dsv2 programattic apis work (plus or minus some options) On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn wrote:

Re: how to copy from one cassandra cluster to another

2020-07-28 Thread Russell Spitzer
You do not need one spark session per cluster. Spark SQL with Datasource v1 http://www.russellspitzer.com/2016/02/16/Multiple-Clusters-SparkSql-Cassandra/ DatasourceV2 Would require making two catalog references then copying between them

Re: spark exception

2020-07-24 Thread Russell Spitzer
Usually this is just the sign that one of the executors quit unexpectedly which explains the dead executors you see in the ui. The next step is usually to go and look at those executor logs and see if there's any reason for the termination. if you end up seeing an abrupt truncation of the log that

Re: Garbage collection issue

2020-07-20 Thread Russell Spitzer
High GC is relatively hard to debug in general but I can give you a few pointers. This basically means that the time spent cleaning up unused objects is high which usually means memory is be used and thrown away rapidly. It can also mean that GC is ineffective, and is being run many times in an

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread Russell Spitzer
The code you linked to is very old and I don't think that method works anymore (Hive context not existing anymore). My latest attempt at trying this was Spark 2.2 and I ran into the issues I wrote about before. In DSV2 it's done via a catalog implementation, so you basically can write a new

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without seeing the rest (and you can confirm this by looking at the DAG visualization in the Spark UI) I would say your first stage with 6 partitions is: Stage 1: Read data from kinesis (or read blocks from receiver not sure what method you are using) and write shuffle files for repartition Stage

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without your code this is hard to determine but a few notes. The number of partitions is usually dictated by the input source, see if it has any configuration which allows you to increase input splits. I'm not sure why you think some of the code is running on the driver. All methods on

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread Russell Spitzer
The last I looked into this the answer is no. I believe since there is a Spark Session internal relation cache, the only way to update a sessions information was a full drop and create. That was my experience with a custom hive metastore and entries read from it. I could change the entries in the

Re: Using spark.jars conf to override jars present in spark default classpath

2020-07-16 Thread Russell Spitzer
Jul 2020 at 12:05, Jeff Evans > wrote: > >> If you can't avoid it, you need to make use of the >> spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst >> properties. >> >> On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer < >> russell.spit...@gma

Re: Using spark.jars conf to override jars present in spark default classpath

2020-07-16 Thread Russell Spitzer
I believe the main issue here is that spark.jars is a bit "too late" to actually prepend things to the class path. For most use cases this value is not read until after the JVM has already started and the system classloader has already loaded. The jar argument gets added via the dynamic class

Re: Truncate table

2020-07-01 Thread Russell Spitzer
I'm not sure what you're really trying to do here but it sounds like saving the data to a park a file or other temporary store before truncating would protect you in case of failure. On Wed, Jul 1, 2020, 9:48 AM Amit Sharma wrote: > Hi, i have scenario where i have to read certain raw from a

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Russell Spitzer
The connector uses Java driver cql request under the hood which means it responds to the changing database like a normal application would. This means retries may result in a different set of data than the original request if the underlying database changed. On Fri, Jun 26, 2020, 9:42 PM Jungtaek

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Russell Spitzer
The error is in the Spark Standalone Worker. It's hitting an OOM while launching/running an executor process. Specifically it's running out of memory when parsing the hadoop configuration trying to figure out the env/command line to run

Re: Copyright Infringment

2020-04-25 Thread Russell Spitzer
I do not see any conflict. I'm not sure what the exact worry of infringement is based on? The Apache license does not restrict anyone from writing a book about a project. On Sat, Apr 25, 2020, 10:48 AM Som Lima wrote: > The text is very clear on the issue of copyright infringement. Ask >

Re: Spark driver thread

2020-03-06 Thread Russell Spitzer
So one thing to know here is that all Java applications are going to use many threads, and just because your particular main method doesn't spawn additional threads doesn't mean library you access or use won't spawn additional threads. The other important note is that Spark doesn't actually equate

Re: How to implement "getPreferredLocations" in Data source v2?

2020-01-18 Thread Russell Spitzer
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/InputPartition.java See InputPartition which had a preferred location parameter you should override On Sat, Jan 18, 2020, 1:44 PM kineret M wrote: > Hi, > I would like to support data

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Russell Spitzer
Is there a chance your data is all even or all odd? On Tue, Dec 17, 2019 at 11:01 AM Tzahi File wrote: > I have in my spark sql query a calculated field that gets the value if > field1 % 3. > > I'm using this field as a partition so I expected to get 3 partitions in > the mentioned case, and I

Re: error , saving dataframe , LEGACY_PASS_PARTITION_BY_AS_OPTIONS

2019-11-13 Thread Russell Spitzer
My guess would be this is a Spark Version mismatch. The option is added https://github.com/apache/spark/commit/df9a50637e2622a15e9af7d837986a0e868878b1 https://issues.apache.org/jira/browse/SPARK-27453 In 2.4.2 I would make sure your Spark installs are all 2.4.4 and that your code is compiled

Re: Spark SQL

2019-06-10 Thread Russell Spitzer
Spark can use the HiveMetastore as a catalog, but it doesn't use the hive parser or optimization engine. Instead it uses Catalyst, see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html On Mon, Jun 10, 2019 at 2:07 PM naresh Goud wrote: > Hi Team, > > Is

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Russell Spitzer
ts of the plan refer to static pieces of >> data ..."* Could you elaborate a bit more on what does this static >> piece of data refer to? Are you referring to the 10 records that had >> already arrived at T1 and are now sitting as old static data in the >> unbounded t

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Russell Spitzer
Dataframes describe the calculation to be done, but the underlying implementation is an "Incremental Query". That is that the dataframe code is executed repeatedly with Catalyst adjusting the final execution plan on each run. Some parts of the plan refer to static pieces of data, others refer to

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-09 Thread Russell Spitzer
2.4.3 Binary is out now and they did change back to 2.11. https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz On Mon, May 6, 2019 at 9:21 PM Russell Spitzer wrote: > Spark 2.4.2 was incorrectly released with the default package binaries set > to Scal

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
Actually i just checked the release, they only changed the pyspark part. So the download on the website will still be 2.12 so you'll need to build the scala 2.11 version of Spark if you want to use the connector. Or Submit a PR for scala 2.12 support On Mon, May 6, 2019 at 9:21 PM Russell Spitzer

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
oo.com/?.src=iOS> > > On Monday, May 6, 2019, 18:34, Russell Spitzer > wrote: > > Scala version mismatched > > Spark is shown at 2.12, the connector only has a 2.11 release > > > > On Mon, May 6, 2019, 7:59 PM Richard Xin > wrote: > > > org.apach

Re: spark-cassandra-connector_2.1 caused java.lang.NoClassDefFoundError under Spark 2.4.2?

2019-05-06 Thread Russell Spitzer
Scala version mismatched Spark is shown at 2.12, the connector only has a 2.11 release On Mon, May 6, 2019, 7:59 PM Richard Xin wrote: > > org.apache.spark > spark-core_2.12 > 2.4.0 > compile > > > org.apache.spark > spark-sql_2.12 > 2.4.0 > > >

Re: 3 equalTo "3.15" = true

2019-02-06 Thread Russell Spitzer
Run an "explain" instead of show, i'm betting it's casting tier_id to a small_int to do the comparison On Wed, Feb 6, 2019 at 9:31 AM Artur Sukhenko wrote: > Hello guys, > I am migrating from Spark 1.6 to 2.2 and have this issue: > I am casting string to short and comparing them with equal . >

Re: Structured Streaming : Custom Source and Sink Development and PySpark.

2018-08-30 Thread Russell Spitzer
Yes, Scala or Java. No. Once you have written the implementation it is valid in all df apis. As for examples there are many, check the Kafka source in tree or one of the many sources listed on the spark packages website. On Thu, Aug 30, 2018, 8:23 PM Ramaswamy, Muthuraman <

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Russell Spitzer
The answer is most likely that when you use Cross Java - Python code you incur a penalty for every objects that you transform from a Java object into a Python object (and then back again to a Python object) when data is being passed in and out of your functions. A way around this would probably be

Re: Does Spark SQL uses Calcite?

2017-08-12 Thread Russell Spitzer
You don't have to go through hive. It's just spark sql. The application is just a forked hive thrift server. On Fri, Aug 11, 2017 at 8:53 PM kant kodali wrote: > @Ryan it looks like if I enable thrift server I need to go through hive. I > was talking more about having JDBC

Re: Spark (SQL / Structured Streaming) Cassandra - PreparedStatement

2017-07-21 Thread Russell Spitzer
The scc includes the java driver. Which means you could just use java driver functions. It also provides a serializable wrapper which has session and prepared statement pooling. Something like val cc = CassandraConnector(sc.getConf) SomeFunctionWithAnIterator{ it: SomeIterator =>

None.get on Redact in DataSourceScanExec

2017-07-13 Thread Russell Spitzer
Sorry if this is a double post, wasn't sure if I got through on my forwarding. I mentioned this in the RC2 note for 2.2.0 of Spark and i'm seeing it now on the official release. Running the Spark Casasnadra Connector integration tests for the SCC now fail whenever trying to do something involving

Fwd: None.get on Redact in DataSourceScanExec

2017-07-13 Thread Russell Spitzer
I mentioned this in the RC2 note for 2.2.0 of Spark and i'm seeing it now on the official release. Running the Spark Casasnadra Connector integration tests for the SCC now fail whenever trying to do something involving the CassandraSource being transformed into the DataSourceScanExec SparkPlan.

Re: How does HashPartitioner distribute data in Spark?

2017-06-25 Thread Russell Spitzer
=> println(it.length)) 1 0 1 1 0 0 0 0 If we actually shuffle the data using the hash partitioner (using the repartition command) we get the expected results. scala> rdd1.repartition(8).foreachPartition(it => println(it.length)) 0 0 0 0 0 0 0 3 On Sat, Jun 24, 2017 at 12:22 PM R

Re: How does HashPartitioner distribute data in Spark?

2017-06-24 Thread Russell Spitzer
Neither of your code examples invoke a repartitioning. Add in a repartition command. On Sat, Jun 24, 2017, 11:53 AM Vikash Pareek wrote: > Hi Vadim, > > Thank you for your response. > > I would like to know how partitioner choose the key, If we look at my >

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread Russell Spitzer
Great catch Anastasios! On Fri, Feb 17, 2017 at 9:59 AM Anastasios Zouzias wrote: > Hey, > > Can you try with the 2.11 spark-cassandra-connector? You just reported > that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar > > Best, > Anastasios > > On Fri, Feb 17, 2017 at

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Russell Spitzer
You can treat Oracle as a JDBC source ( http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back out (see the same link) and write directly to Oracle. I'll leave the

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md Has all the information about Dataframes/ SparkSql On Fri, Nov 11, 2016 at 8:52 AM kant kodali wrote: > Wait I cannot create CassandraSQLContext from spark-shell. is this only > for

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-15 Thread Russell Spitzer
any conflicts with cached or older > versions. I only have SPARK_HOME environment variable set (env variables > related to Spark and Python). > > -- > *From:* Russell Spitzer <russell.spit...@gmail.com> > *To:* Trivedi Amit <amit_...@yahoo.com>; "user@

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-14 Thread Russell Spitzer
Spark 2.0 defaults to Scala 2.11, so if you didn't build it yourself you need the 2.11 artifact for the Spark Cassandra Connector. On Wed, Sep 14, 2016 at 7:44 PM Trivedi Amit wrote: > Hi, > > > > I am testing a pyspark program that will read from a csv file and

Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
This would also be a better question for the SCC user list :) https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user On Sun, Sep 4, 2016 at 9:31 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > > https://github.com/datastax/spark-cassandra-connector

Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/v1.3.1/doc/14_data_frames.md In Spark 1.3 it was illegal to use "table" as a key in Spark SQL so in that version of Spark the connector needed to use the option "c_table" val df = sqlContext.read. |

Re: Insert non-null values from dataframe

2016-08-26 Thread Russell Spitzer
Cassandra does not differentiate between null and empty, so when reading from C* all empty values are reported as null. To avoid inserting nulls (avoiding tombstones) see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#globally-treating-all-nulls-as-unset This

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Russell Spitzer
Spark streaming does not process 1 event at a time which is in general I think what people call "Streaming." It instead processes groups of events. Each group is a "MicroBatch" that gets processed at the same time. Streaming theoretically always has better latency because the event is processed