OData compliant API for Spark

2018-12-04 Thread Affan Syed
All,

We have been thinking about exposing our platform for analytics an OData
server (for its ease of compliance with 3rd party BI tools like Tableau,
etc) -- so Livy is not in the picture right now.

Has there been any effort on this regards? Is there any interest or has
there been any discussion that someone can point towards?

We want to expose this connection over API so the JDBC->thriftserver->Spark
route is not being considered right now.


- Affan
ᐧ


[ANNOUNCE] Apache Bahir 2.3.1 Released

2018-12-04 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic
platforms, extending their reach with a diversity of streaming
connectors and SQL data sources.
The Apache Bahir community is pleased to announce the release of
Apache Bahir 2.3.1 which provides the following extensions for Apache
Spark 2.3.1:

   - Apache CouchDB/Cloudant SQL data source
   - Apache CouchDB/Cloudant Streaming connector
   - Akka Streaming connector
   - Akka Structured Streaming data source
   - Google Cloud Pub/Sub Streaming connector
   - Cloud PubNub Streaming connector (new)
   - MQTT Streaming connector
   - MQTT Structured Streaming data source (new sink)
   - Twitter Streaming connector
   - ZeroMQ Streaming connector (new enhanced implementation)

For more information about Apache Bahir and to download the
latest release go to:

http://bahir.apache.org

For more details on how to use Apache Bahir extensions in your
application please visit our documentation page

   http://bahir.apache.org/docs/spark/overview/

The Apache Bahir PMC

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Bahir 2.3.2 Released

2018-12-04 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic
platforms, extending their reach with a diversity of streaming
connectors and SQL data sources.
The Apache Bahir community is pleased to announce the release of
Apache Bahir 2.3.2 which provides the following extensions for Apache
Spark 2.3.2:

   - Apache CouchDB/Cloudant SQL data source
   - Apache CouchDB/Cloudant Streaming connector
   - Akka Streaming connector
   - Akka Structured Streaming data source
   - Google Cloud Pub/Sub Streaming connector
   - Cloud PubNub Streaming connector (new)
   - MQTT Streaming connector
   - MQTT Structured Streaming data source (new sink)
   - Twitter Streaming connector
   - ZeroMQ Streaming connector (new enhanced implementation)

For more information about Apache Bahir and to download the
latest release go to:

http://bahir.apache.org

For more details on how to use Apache Bahir extensions in your
application please visit our documentation page

   http://bahir.apache.org/docs/spark/overview/

The Apache Bahir PMC

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Bahir 2.3.0 Released

2018-12-04 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic
platforms, extending their reach with a diversity of streaming
connectors and SQL data sources.
The Apache Bahir community is pleased to announce the release of
Apache Bahir 2.3.0 which provides the following extensions for Apache
Spark 2.3.0:

   - Apache CouchDB/Cloudant SQL data source
   - Apache CouchDB/Cloudant Streaming connector
   - Akka Streaming connector
   - Akka Structured Streaming data source
   - Google Cloud Pub/Sub Streaming connector
   - Cloud PubNub Streaming connector (new)
   - MQTT Streaming connector
   - MQTT Structured Streaming data source (new sink)
   - Twitter Streaming connector
   - ZeroMQ Streaming connector (new enhanced implementation)

For more information about Apache Bahir and to download the
latest release go to:

http://bahir.apache.org

For more details on how to use Apache Bahir extensions in your
application please visit our documentation page

   http://bahir.apache.org/docs/spark/overview/

The Apache Bahir PMC

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Structured Streaming] Dynamically changing maxOffsetsPerTrigger

2018-12-04 Thread subramgr
Is there a way to dynamically change the value of *maxOffsetsPerTrigger* ?





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Recommended Node Usage

2018-12-04 Thread Hans Fischer
Dear Spark-Community,

 

is it recommended (and why) to use a hardware node by 100% (put one or more 
vcores onto every hardware-core) , instead of using the node by spark at 95% 
system-load to gain a better system stability?

 

Thank you very much for your contribution,

Hans

 



Unsubscribe

2018-12-04 Thread GmailLiang
Unsubscribe 

Sent from Tianchu(Alex) iPhone

On Dec 4, 2018, at 00:00, Nirmal Manoharan  wrote:

I am trying to deduplicate on streaming data using the dropDuplicate function 
with watermark. The problem I am facing currently is that I have to two 
timestamps for a given record
1. One is the eventtimestamp - timestamp of the record creation from the source
2. Another is an transfer timestamp - timestamp from an intermediate process 
that is responsible to stream the data. 
The duplicates are introduced during the intermediate stage so for a given a 
record duplicate, the eventtimestamp is same but transfer timestamp is 
different. 

For the watermark, I like to use the transfertimestamp because I know the 
duplicates cant occur more than 3 minutes apart in transfer. But I cant use it 
within dropDuplicate because it wont capture the duplicates as the duplicates 
have different transfer timestamp. 

Here is an example,
Event 1:{ "EventString":"example1", "Eventtimestamp": 
"2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:05:00.00" }
Event 2 (duplicate): {"EventString":"example1", "Eventtimestamp": 
"2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:08:00.00"}

In this case, the duplicate was created during transfer after 3 mins from the 
original event

My code is like below,

streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring","transferTimestamp");

The above code won't drop the duplicates as transferTimestamp is unique for the 
event and its duplicate. But currently, this is the only way as Spark forces me 
to include the watermark column in the dropDuplicates function. 

I would really like to see an dropDuplicate implementation like below which 
would be a valid case for any at-least once semantics streams where I dont have 
to use the watermark field in dropDuplicates and still the watermark based 
state eviction is honored. 
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring");

If anyone has an alternate solution for this, please let me know. I cant use 
the eventtimestamp as it is not ordered and time range varies drastically 
(delayed events and junk events). 

Thanks in advance
-Nirmal


Re: Job hangs in blocked task in final parquet write stage

2018-12-04 Thread Conrad Lee
Yeah, probably increasing the memory or increasing the number of output
partitions would help.  However increasing memory available to each
executor would add expense.  I want to keep the number of partitions low so
that each parquet file turns out to be around 128 mb, which is best
practice for long-term storage and use with other systems like presto.

This feels like a bug due to the flakey nature of the failure -- also,
usually when the memory gets too low the executor is killed or errors out
and I get one of the typical Spark OOM error codes.  When I run the same
job with the same resources sometimes this job succeeds, and sometimes it
fails.

On Mon, Dec 3, 2018 at 5:19 PM Christopher Petrino <
christopher.petr...@gmail.com> wrote:

> Depending on the size of your data set and how how many resources you have
> (num-executors, executor instances, number of nodes) I'm inclined to
> suspect that issue is related to reduction of partitions from thousands to
> 96; I could be misguided but given the details I have I would consider
> testing an approach to understand the behavior if the final stage operates
> at different number of partitions.
>
> On Mon, Dec 3, 2018 at 2:48 AM Conrad Lee  wrote:
>
>> Thanks for the thoughts.  While the beginning of the job deals with lots
>> of files in the first stage, they're first coalesced down into just a few
>> thousand partitions.  The part of the job that's failing is the reduce-side
>> of a dataframe.sort() that writes output to HDFS.  This last stage has only
>> 96 tasks and the partitions are well balanced.  I'm not using a
>> `partitionBy` option on the dataframe writer.
>>
>> On Fri, Nov 30, 2018 at 8:14 PM Christopher Petrino <
>> christopher.petr...@gmail.com> wrote:
>>
>>> The reason I ask is because I've had some unreliability caused by over
>>> stressing the HDFS. Do you know the number of partitions when these actions
>>> are being. i.e. if you have 1,000,000 files being read you may have
>>> 1,000,000 partitions which may cause HDFS stress. Alternatively if you have
>>> 1 large file, say 100 GB, you may 1 partition which would not fit in memory
>>> and may cause writes to disk. I imagine it may be flaky because you are
>>> doing some action like a groupBy somewhere and depending on how the data
>>> was read certain groups will be in certain partitions; I'm not sure if
>>> reads on files are deterministic, I suspect they are not
>>>
>>> On Fri, Nov 30, 2018 at 2:08 PM Conrad Lee  wrote:
>>>
 I'm loading the data using the dataframe reader from parquet files
 stored on local HDFS.  The stage of the job that fails is not the stage
 that does this.  The stage of the job that fails is one that reads a sorted
 dataframe from the last shuffle and performs the final write to parquet on
 local HDFS.

 On Fri, Nov 30, 2018 at 4:02 PM Christopher Petrino <
 christopher.petr...@gmail.com> wrote:

> How are you loading the data?
>
> On Fri, Nov 30, 2018 at 2:26 AM Conrad Lee  wrote:
>
>> Thanks for the suggestions.  Here's an update that responds to some
>> of the suggestions/ideas in-line:
>>
>> I ran into problems using 5.19 so I referred to 5.17 and it resolved
>>> my issues.
>>
>>
>> I tried EMR 5.17.0 and the problem still sometimes occurs.
>>
>>  try running a coalesce. Your data may have grown and is defaulting
>>> to a number of partitions that causing unnecessary overhead
>>>
>> Well I don't think it's that because this problem occurs flakily.
>> That is, if the job hangs I can kill it and re-run it and it works fine 
>> (on
>> the same hardware and with the same memory settings).  I'm not getting 
>> any
>> OOM errors.
>>
>> On a related note: the job is spilling to disk. I see messages like
>> this:
>>
>> 18/11/29 21:40:06 INFO UnsafeExternalSorter: Thread 156 spilling sort
>>> data of 912.0 MB to disk (3  times so far)
>>
>>
>>  This occurs in both successful and unsuccessful runs though.  I've
>> checked the disks of an executor that's running a hanging job and its 
>> disks
>> have plenty of space, so it doesn't seem to be an out of disk space 
>> issue.
>> This also doesn't seem to be where it hangs--the logs move on and 
>> describe
>> the the parquet commit.
>>
>> On Thu, Nov 29, 2018 at 4:06 PM Christopher Petrino <
>> christopher.petr...@gmail.com> wrote:
>>
>>> If not, try running a coalesce. Your data may have grown and is
>>> defaulting to a number of partitions that causing unnecessary overhead
>>>
>>> On Thu, Nov 29, 2018 at 3:02 AM Conrad Lee 
>>> wrote:
>>>
 Thanks, I'll try using 5.17.0.

 For anyone trying to debug this problem in the future: In other
 jobs that hang in the same manner, the thread dump didn't have any 
 blocked
 threads, so that might be a red herring.

unsubscribe

2018-12-04 Thread Junior Alvarez