Re: How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-06 Thread Gourav Sengupta
Hi
There is a method to iterate only once in Spark. I use it for reading files
using streaming. May be you can try that.
Regards,
Gourav

On Tue, 6 Aug 2019, 21:50 kant kodali,  wrote:

> If I stop and start while processing the batch what will happen? will that
> batch gets canceled and gets reprocessed again when I click start? Does
> that mean I need to worry about duplicates in the downstream? Kafka
> consumers have a pause and resume and they work just fine so I am not sure
> why Spark doesn't expose that.
>
>
> On Mon, Aug 5, 2019 at 10:54 PM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> exactly my question, I was also looking for ways to gracefully exit spark
>> structured streaming.
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Aug 6, 2019 at 3:43 AM kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I am trying to see if there is a way to pause a spark stream that
>>> process data from Kafka such that my application can take some actions
>>> while the stream is paused and resume when the application completes those
>>> actions.
>>>
>>> Thanks!
>>>
>>


unsubscribe

2019-08-06 Thread Information Technologies
unsubscribe

-- 


**

This
 email and any files transmitted with it are confidential and intended 

solely for the use of the individual or entity to whom they are 

addressed. They may not be disseminated or distributed to persons or 

entities other than the ones intended without the authority of the 
sender. 
If you have received this email in error or are not the
 intended 
recipient, you may not use, copy, disseminate or distribute 
it. Delete it 
immediately from your system and notify the sender 
promptly
 by email that 
you have done so. This footnote also confirms that this 
email message has 
been scanned for the presence of computer viruses.


** 

 
Please consider the environment before printing

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Announcing Delta Lake 0.3.0

2019-08-06 Thread Nicolas Paris
>   • Scala/Java APIs for DML commands - You can now modify data in Delta Lake
> tables using programmatic APIs for Delete, Update and Merge. These APIs
> mirror the syntax and semantics of their corresponding SQL commands and 
> are
> great for many workloads, e.g., Slowly Changing Dimension (SCD) 
> operations,
> merging change data for replication, and upserts from streaming queries.
> See the documentation for more details.

just tested the merge feature on a large table: awesome
- fast to build
- fast to query afterward
- robust (version history is an incredible feature)


thanks


On Thu, Aug 01, 2019 at 06:44:30PM -0700, Tathagata Das wrote:
> Hello everyone, 
> 
> We are excited to announce the availability of Delta Lake 0.3.0 which
> introduces new programmatic APIs for manipulating and managing data in Delta
> Lake tables.
> 
> 
> Here are the main features: 
> 
> 
>   • Scala/Java APIs for DML commands - You can now modify data in Delta Lake
> tables using programmatic APIs for Delete, Update and Merge. These APIs
> mirror the syntax and semantics of their corresponding SQL commands and 
> are
> great for many workloads, e.g., Slowly Changing Dimension (SCD) 
> operations,
> merging change data for replication, and upserts from streaming queries.
> See the documentation for more details.
> 
> 
>   • Scala/Java APIs for query commit history - You can now query a table’s
> commit history to see what operations modified the table. This enables you
> to audit data changes, time travel queries on specific versions, debug and
> recover data from accidental deletions, etc. See the documentation for 
> more
> details.
> 
> 
>   • Scala/Java APIs for vacuuming old files - Delta Lake uses MVCC to enable
> snapshot isolation and time travel. However, keeping all versions of a
> table forever can be prohibitively expensive. Stale snapshots (as well as
> other uncommitted files from aborted transactions) can be garbage 
> collected
> by vacuuming the table. See the documentation for more details.
> 
> 
> To try out Delta Lake 0.3.0, please follow the Delta Lake Quickstart: https://
> docs.delta.io/0.3.0/quick-start.html
> 
> To view the release notes:
> https://github.com/delta-io/delta/releases/tag/v0.3.0
> 
> We would like to thank all the community members for contributing to this
> release.
> 
> TD

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to read configuration file parameters in Spark without mapping each parameter

2019-08-06 Thread Mich Talebzadeh
Hi,

Assume that I have a configuration file as below with static parameters
some Strings, Integer and Double:

md_AerospikeAerospike {
dbHost = "rhes75"
dbPort = "3000"
dbConnection = "trading_user_RW"
namespace = "trading"
dbSetRead = "MARKETDATAAEROSPIKEBATCH"
dbSetWrite = "MARKETDATAAEROSPIKESPEED"
dbPassword = "aerospike"
bootstrapServers = "rhes75:9092, rhes75:9093, rhes75:9094, rhes564:9092,
rhes564:9093, rhes564:9094, rhes76:9092, rhes76:9093, rhes76:9094"
schemaRegistryURL = "http://rhes75:8081;
zookeeperConnect = "rhes75:2181, rhes564:2181, rhes76:2181"
zookeeperConnectionTimeoutMs = "1"
rebalanceBackoffMS = "15000"
zookeeperSessionTimeOutMs = "15000"
autoCommitIntervalMS = "12000"
topics = "md"
memorySet = "F"
enableHiveSupport = "true"
sparkStreamingReceiverMaxRate = "0"
checkpointdir = "/checkpoint"
hbaseHost = "rhes75"
zookeeperHost = "rhes75"
zooKeeperClientPort = "2181"
batchInterval = 2
tickerWatch = "VOD"
priceWatch = 300.0
op_type = 1
}

Now in Scala code I do the following to read these parameters in:

  val dbHost = conf.getString("dbHost")
  val dbPort = conf.getString("dbPort")
  val dbConnection = conf.getString("dbConnection")
  val namespace = conf.getString("namespace")
  val dbSetRead = conf.getString("dbSetRead")
  val dbSetWrite = conf.getString("dbSetWrite")
  val dbPassword = conf.getString("dbPassword")
  val bootstrapServers = conf.getString("bootstrapServers")
  val schemaRegistryURL = conf.getString("schemaRegistryURL")
  val zookeeperConnect = conf.getString("zookeeperConnect")
  val zookeeperConnectionTimeoutMs =
conf.getString("zookeeperConnectionTimeoutMs")
  val rebalanceBackoffMS = conf.getString("zookeeperConnectionTimeoutMs")
  val zookeeperSessionTimeOutMs =
conf.getString("zookeeperSessionTimeOutMs")
  val autoCommitIntervalMS = conf.getString("autoCommitIntervalMS")
  val topics =  conf.getString("topics")
  val memorySet = conf.getString("memorySet")
  val enableHiveSupport = conf.getString("enableHiveSupport")
  val sparkStreamingReceiverMaxRate =
conf.getString("sparkStreamingReceiverMaxRate")
  val checkpointdir = conf.getString("checkpointdir")
  val hbaseHost = conf.getString("hbaseHost")
  val zookeeperHost = conf.getString("zookeeperHost")
  val zooKeeperClientPort = conf.getString("zooKeeperClientPort")
  val batchInterval = conf.getInt("batchInterval")
  val tickerWatch = conf.getString("tickerWatch")
  val priceWatch= conf.getDouble("priceWatch")
  val op_type = conf.getInt("op_type")

This is obviously tedious and error prone. Is there anyway of reading these
parameters generically through mapping or any other way please?

Thanks

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


unsubscribe

2019-08-06 Thread Information Technologies
unsubscribe

-- 


**

This
 email and any files transmitted with it are confidential and intended 

solely for the use of the individual or entity to whom they are 

addressed. They may not be disseminated or distributed to persons or 

entities other than the ones intended without the authority of the 
sender. 
If you have received this email in error or are not the
 intended 
recipient, you may not use, copy, disseminate or distribute 
it. Delete it 
immediately from your system and notify the sender 
promptly
 by email that 
you have done so. This footnote also confirms that this 
email message has 
been scanned for the presence of computer viruses.


** 

 
Please consider the environment before printing

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2019-08-06 Thread Peter Willis
unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-06 Thread kant kodali
If I stop and start while processing the batch what will happen? will that
batch gets canceled and gets reprocessed again when I click start? Does
that mean I need to worry about duplicates in the downstream? Kafka
consumers have a pause and resume and they work just fine so I am not sure
why Spark doesn't expose that.


On Mon, Aug 5, 2019 at 10:54 PM Gourav Sengupta 
wrote:

> Hi,
>
> exactly my question, I was also looking for ways to gracefully exit spark
> structured streaming.
>
>
> Regards,
> Gourav
>
> On Tue, Aug 6, 2019 at 3:43 AM kant kodali  wrote:
>
>> Hi All,
>>
>> I am trying to see if there is a way to pause a spark stream that process
>> data from Kafka such that my application can take some actions while the
>> stream is paused and resume when the application completes those actions.
>>
>> Thanks!
>>
>


CVE-2019-10099: Apache Spark unencrypted data on local disk

2019-08-06 Thread Imran Rashid
 Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, Spark 2.1.x, and 2.2.x versions
Spark 2.3.0 to 2.3.2


Description:
Prior to Spark 2.3.3, in certain situations Spark would write user data to
local disk unencrypted, even if spark.io.encryption.enabled=true.  This
includes cached blocks that are fetched to disk (controlled by
spark.maxRemoteBlockSizeFetchToMem); in SparkR, using parallelize; in
Pyspark, using broadcast and parallelize; and use of python udfs.


Mitigation:
1.x, 2.0.x, 2.1.x, 2.2.x, 2.3.x  users should upgrade to 2.3.3 or newer,
including 2.4.x.

Credit:
This issue was reported by Thomas Graves of NVIDIA.

References:
https://spark.apache.org/security.html
https://issues.apache.org/jira/browse/SPARK-28626


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Mich Talebzadeh
which versions of Spark and Hive are you using.

what will happen if you use parquet tables instead?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
wrote:

> Hi.
> I have built a Hive external table on top of a directory 'A' which has
> data stored in ORC format. This directory has several subdirectories inside
> it, each of which contains the actual ORC files.
> These subdirectories are actually created by spark jobs which ingest data
> from other sources and write it into this directory.
> I tried creating a table and setting the table properties of the same as
> *hive.mapred.supports.subdirectories=TRUE* and
> *mapred.input.dir.recursive**=TRUE*.
> As a result of this, when i fire the simplest query of *select count(*)
> from ExtTable* via the Hive CLI, it successfully gives me the expected
> count of records in the table.
> However, when i fire the same query via sparkSQL, i get count = 0.
>
> I think the sparkSQL isn't able to descend into the subdirectories for
> getting the data while hive is able to do so.
> Are there any configurations needed to be set on the spark side so that
> this works as it does via hive cli?
> I am using Spark on YARN.
>
> Thanks,
> Rishikesh
>
> Tags: subdirectories, subdirectory, recursive, recursion, hive external
> table, orc, sparksql, yarn
>


Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Rishikesh Gawade
Hi.
I have built a Hive external table on top of a directory 'A' which has data
stored in ORC format. This directory has several subdirectories inside it,
each of which contains the actual ORC files.
These subdirectories are actually created by spark jobs which ingest data
from other sources and write it into this directory.
I tried creating a table and setting the table properties of the same as
*hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive*
*=TRUE*.
As a result of this, when i fire the simplest query of *select count(*)
from ExtTable* via the Hive CLI, it successfully gives me the expected
count of records in the table.
However, when i fire the same query via sparkSQL, i get count = 0.

I think the sparkSQL isn't able to descend into the subdirectories for
getting the data while hive is able to do so.
Are there any configurations needed to be set on the spark side so that
this works as it does via hive cli?
I am using Spark on YARN.

Thanks,
Rishikesh

Tags: subdirectories, subdirectory, recursive, recursion, hive external
table, orc, sparksql, yarn