[jira] [Commented] (BAHIR-8) Investigate weather we can use org.apache.bahir packages on Apache Spark extensions

2016-06-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331953#comment-15331953
 ] 

Steve Loughran commented on BAHIR-8:


I think this is something to raise with the spark team: if something is needed 
for these plugins outside Spark then they either need to be opened up, or code 
is stuck into the spark packages. Some things which go deep into the code will 
inevitably end up having org.apache.spark names; I don't think we should give 
up quite so easily.

One thing to think of: existing entry points for other apps may need to be 
retained, that includes classes named on the CLI. Always a PITA

> Investigate weather we can use org.apache.bahir packages on Apache Spark 
> extensions
> ---
>
> Key: BAHIR-8
> URL: https://issues.apache.org/jira/browse/BAHIR-8
> Project: Bahir
>  Issue Type: Task
>  Components: Spark Streaming Connectors
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Assignee: Christian Kadner
>
> Experiment if we can move the current extensions to org.apache.bahir package. 
> We might see issues related to the private[spark], etc restriction that scala 
> imposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-18 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676859#comment-15676859
 ] 

Steve Loughran commented on BAHIR-67:
-

This is what confuses me: all you should need to do for webhdfs access is use 
the right path, one that begins {{webhdfs://}}; the implementation is in the 
hadoop-hdfs/hadoop-hdfs-client JAR and it shoud just work, What would be useful 
is integration tests; a MiniHDFSCluster can be brought up with webhdfs enabled 
for that testing

> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572371#comment-15572371
 ] 

Steve Loughran commented on BAHIR-67:
-

Is this really just a matter of getting hadoop webhdfs on the CP?

> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> ---
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: Improvement
>  Components: Spark SQL Data Sources
>Affects Versions: Not Applicable
>Reporter: Sourav Mazumder
> Fix For: Spark-2.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573088#comment-15573088
 ] 

Steve Loughran commented on BAHIR-67:
-

this is very much a sibling of the SPARK-7481 patch where I've been trying to 
add a module for dependencies and tests. ignoring the problem of getting a 
webhdfs JAR into SPARK_HOME/jars, the tests in that module should cover what's 
needed, both in terms of operations (basic IO) and the more minimal 
classpath/config checking.

I think you could bring up minidfs cluster in webhdfs mode, so have a 
functional test of things

> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> ---
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: Improvement
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-213) Faster S3 file Source for Structured Streaming with SQS

2019-07-30 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895999#comment-16895999
 ] 

Steve Loughran commented on BAHIR-213:
--

BTW, because of the delay between S3 change and event being processed, there's 
a risk of changes in the store happening before the stream handler sees it

1. POST path
2. event #1 queued
3. DELETE path
4. event #2 queued
5. event #1 received
5. FNFE when querying file

Also: double update

1. POST path
2. event #1 queued
3. POST path 
4. event #2 queued
5. event #1 received
6.. contents of path are at state (3)
7. event #2 received even though state hasn't changed

there's also two other issues
*  the risk of events arriving out of order.
* the risk of a previous state of the file (contents or tombstone) being seen 
in processing event #1

What does that mean? I think it means that you need to handle
* file potentially missing when you receive the event...but you still need to 
handle the possibility that a tombstone was cached before the post #1 
operation, so may want to spin a bit awaiting its arrival.
* file details when processing event different from that in the event data

the best thing to do here is demand that every file uploaded MUST have a unique 
name, while making sure that the new stream source is resilient to changes (i.e 
downgrades if the source file isn't there...), without offering any guarantees 
of correctness






> Faster S3 file Source for Structured Streaming with SQS
> ---
>
> Key: BAHIR-213
> URL: https://issues.apache.org/jira/browse/BAHIR-213
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark Structured Streaming Connectors
>Affects Versions: Spark-2.4.0
>Reporter: Abhishek Dixit
>Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull 
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as 
> suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi]  
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BAHIR-213) Faster S3 file Source for Structured Streaming with SQS

2019-07-26 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893718#comment-16893718
 ] 

Steve Loughran commented on BAHIR-213:
--

I've thought about doing this for a while. Note that the event queue is itself 
inconsistent; you still need to scan for changes sporadically.

The other cloud infras provide similar event queues (at least Azure does); be 
good to design this to be somewhat independent.

Testing will be fun.

> Faster S3 file Source for Structured Streaming with SQS
> ---
>
> Key: BAHIR-213
> URL: https://issues.apache.org/jira/browse/BAHIR-213
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark Structured Streaming Connectors
>Affects Versions: Spark-2.3.0, Spark-2.4.0
>Reporter: Abhishek Dixit
>Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull 
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as 
> suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi]  
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)