OData compliant API for Spark
All, We have been thinking about exposing our platform for analytics an OData server (for its ease of compliance with 3rd party BI tools like Tableau, etc) -- so Livy is not in the picture right now. Has there been any effort on this regards? Is there any interest or has there been any discussion that someone can point towards? We want to expose this connection over API so the JDBC->thriftserver->Spark route is not being considered right now. - Affan ᐧ
[ANNOUNCE] Apache Bahir 2.3.1 Released
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.3.1 which provides the following extensions for Apache Spark 2.3.1: - Apache CouchDB/Cloudant SQL data source - Apache CouchDB/Cloudant Streaming connector - Akka Streaming connector - Akka Structured Streaming data source - Google Cloud Pub/Sub Streaming connector - Cloud PubNub Streaming connector (new) - MQTT Streaming connector - MQTT Structured Streaming data source (new sink) - Twitter Streaming connector - ZeroMQ Streaming connector (new enhanced implementation) For more information about Apache Bahir and to download the latest release go to: http://bahir.apache.org For more details on how to use Apache Bahir extensions in your application please visit our documentation page http://bahir.apache.org/docs/spark/overview/ The Apache Bahir PMC -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Apache Bahir 2.3.2 Released
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.3.2 which provides the following extensions for Apache Spark 2.3.2: - Apache CouchDB/Cloudant SQL data source - Apache CouchDB/Cloudant Streaming connector - Akka Streaming connector - Akka Structured Streaming data source - Google Cloud Pub/Sub Streaming connector - Cloud PubNub Streaming connector (new) - MQTT Streaming connector - MQTT Structured Streaming data source (new sink) - Twitter Streaming connector - ZeroMQ Streaming connector (new enhanced implementation) For more information about Apache Bahir and to download the latest release go to: http://bahir.apache.org For more details on how to use Apache Bahir extensions in your application please visit our documentation page http://bahir.apache.org/docs/spark/overview/ The Apache Bahir PMC -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Apache Bahir 2.3.0 Released
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.3.0 which provides the following extensions for Apache Spark 2.3.0: - Apache CouchDB/Cloudant SQL data source - Apache CouchDB/Cloudant Streaming connector - Akka Streaming connector - Akka Structured Streaming data source - Google Cloud Pub/Sub Streaming connector - Cloud PubNub Streaming connector (new) - MQTT Streaming connector - MQTT Structured Streaming data source (new sink) - Twitter Streaming connector - ZeroMQ Streaming connector (new enhanced implementation) For more information about Apache Bahir and to download the latest release go to: http://bahir.apache.org For more details on how to use Apache Bahir extensions in your application please visit our documentation page http://bahir.apache.org/docs/spark/overview/ The Apache Bahir PMC -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[Spark Structured Streaming] Dynamically changing maxOffsetsPerTrigger
Is there a way to dynamically change the value of *maxOffsetsPerTrigger* ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Recommended Node Usage
Dear Spark-Community, is it recommended (and why) to use a hardware node by 100% (put one or more vcores onto every hardware-core) , instead of using the node by spark at 95% system-load to gain a better system stability? Thank you very much for your contribution, Hans
Unsubscribe
Unsubscribe Sent from Tianchu(Alex) iPhone On Dec 4, 2018, at 00:00, Nirmal Manoharan wrote: I am trying to deduplicate on streaming data using the dropDuplicate function with watermark. The problem I am facing currently is that I have to two timestamps for a given record 1. One is the eventtimestamp - timestamp of the record creation from the source 2. Another is an transfer timestamp - timestamp from an intermediate process that is responsible to stream the data. The duplicates are introduced during the intermediate stage so for a given a record duplicate, the eventtimestamp is same but transfer timestamp is different. For the watermark, I like to use the transfertimestamp because I know the duplicates cant occur more than 3 minutes apart in transfer. But I cant use it within dropDuplicate because it wont capture the duplicates as the duplicates have different transfer timestamp. Here is an example, Event 1:{ "EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:05:00.00" } Event 2 (duplicate): {"EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:08:00.00"} In this case, the duplicate was created during transfer after 3 mins from the original event My code is like below, streamDataset. .withWatermark("transferTimestamp", "4 minutes") .dropDuplicates("eventstring","transferTimestamp"); The above code won't drop the duplicates as transferTimestamp is unique for the event and its duplicate. But currently, this is the only way as Spark forces me to include the watermark column in the dropDuplicates function. I would really like to see an dropDuplicate implementation like below which would be a valid case for any at-least once semantics streams where I dont have to use the watermark field in dropDuplicates and still the watermark based state eviction is honored. streamDataset. .withWatermark("transferTimestamp", "4 minutes") .dropDuplicates("eventstring"); If anyone has an alternate solution for this, please let me know. I cant use the eventtimestamp as it is not ordered and time range varies drastically (delayed events and junk events). Thanks in advance -Nirmal
Re: Job hangs in blocked task in final parquet write stage
Yeah, probably increasing the memory or increasing the number of output partitions would help. However increasing memory available to each executor would add expense. I want to keep the number of partitions low so that each parquet file turns out to be around 128 mb, which is best practice for long-term storage and use with other systems like presto. This feels like a bug due to the flakey nature of the failure -- also, usually when the memory gets too low the executor is killed or errors out and I get one of the typical Spark OOM error codes. When I run the same job with the same resources sometimes this job succeeds, and sometimes it fails. On Mon, Dec 3, 2018 at 5:19 PM Christopher Petrino < christopher.petr...@gmail.com> wrote: > Depending on the size of your data set and how how many resources you have > (num-executors, executor instances, number of nodes) I'm inclined to > suspect that issue is related to reduction of partitions from thousands to > 96; I could be misguided but given the details I have I would consider > testing an approach to understand the behavior if the final stage operates > at different number of partitions. > > On Mon, Dec 3, 2018 at 2:48 AM Conrad Lee wrote: > >> Thanks for the thoughts. While the beginning of the job deals with lots >> of files in the first stage, they're first coalesced down into just a few >> thousand partitions. The part of the job that's failing is the reduce-side >> of a dataframe.sort() that writes output to HDFS. This last stage has only >> 96 tasks and the partitions are well balanced. I'm not using a >> `partitionBy` option on the dataframe writer. >> >> On Fri, Nov 30, 2018 at 8:14 PM Christopher Petrino < >> christopher.petr...@gmail.com> wrote: >> >>> The reason I ask is because I've had some unreliability caused by over >>> stressing the HDFS. Do you know the number of partitions when these actions >>> are being. i.e. if you have 1,000,000 files being read you may have >>> 1,000,000 partitions which may cause HDFS stress. Alternatively if you have >>> 1 large file, say 100 GB, you may 1 partition which would not fit in memory >>> and may cause writes to disk. I imagine it may be flaky because you are >>> doing some action like a groupBy somewhere and depending on how the data >>> was read certain groups will be in certain partitions; I'm not sure if >>> reads on files are deterministic, I suspect they are not >>> >>> On Fri, Nov 30, 2018 at 2:08 PM Conrad Lee wrote: >>> I'm loading the data using the dataframe reader from parquet files stored on local HDFS. The stage of the job that fails is not the stage that does this. The stage of the job that fails is one that reads a sorted dataframe from the last shuffle and performs the final write to parquet on local HDFS. On Fri, Nov 30, 2018 at 4:02 PM Christopher Petrino < christopher.petr...@gmail.com> wrote: > How are you loading the data? > > On Fri, Nov 30, 2018 at 2:26 AM Conrad Lee wrote: > >> Thanks for the suggestions. Here's an update that responds to some >> of the suggestions/ideas in-line: >> >> I ran into problems using 5.19 so I referred to 5.17 and it resolved >>> my issues. >> >> >> I tried EMR 5.17.0 and the problem still sometimes occurs. >> >> try running a coalesce. Your data may have grown and is defaulting >>> to a number of partitions that causing unnecessary overhead >>> >> Well I don't think it's that because this problem occurs flakily. >> That is, if the job hangs I can kill it and re-run it and it works fine >> (on >> the same hardware and with the same memory settings). I'm not getting >> any >> OOM errors. >> >> On a related note: the job is spilling to disk. I see messages like >> this: >> >> 18/11/29 21:40:06 INFO UnsafeExternalSorter: Thread 156 spilling sort >>> data of 912.0 MB to disk (3 times so far) >> >> >> This occurs in both successful and unsuccessful runs though. I've >> checked the disks of an executor that's running a hanging job and its >> disks >> have plenty of space, so it doesn't seem to be an out of disk space >> issue. >> This also doesn't seem to be where it hangs--the logs move on and >> describe >> the the parquet commit. >> >> On Thu, Nov 29, 2018 at 4:06 PM Christopher Petrino < >> christopher.petr...@gmail.com> wrote: >> >>> If not, try running a coalesce. Your data may have grown and is >>> defaulting to a number of partitions that causing unnecessary overhead >>> >>> On Thu, Nov 29, 2018 at 3:02 AM Conrad Lee >>> wrote: >>> Thanks, I'll try using 5.17.0. For anyone trying to debug this problem in the future: In other jobs that hang in the same manner, the thread dump didn't have any blocked threads, so that might be a red herring.