[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-06-04 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125944#comment-17125944
 ] 

Etienne Chauchot commented on FLINK-17961:
--

I just commented in the existing design discussion thread: 
[https://lists.apache.org/thread.html/r33cd907cecfd125ab1164ddc8a4d8e45d6bd3afd332fbb034881b1ff%40%3Cdev.flink.apache.org%3E]

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-06-04 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125806#comment-17125806
 ] 

Etienne Chauchot commented on FLINK-17961:
--

[~chesnay] ES source can definitely mask the overall complexity to the user. As 
an example in Apache Beam ([available 
here|https://github.com/apache/beam/blob/e1963c11f9a853564d62f83993dec08ed8a9321f/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L156])
 what we do is we use sliced scroll to split the input collection for parallel 
reading and apply it to the user ES query or to a default _select * from index_ 
when there is no provided query. Thus, the user API remains simple with 
_ESIO.read().from(index).withQuery(query)._ My worries here are more related to 
streaming and failover capabilities raised by Aljoscha. Even though ES is a 
main source (not an enrichment one IMO) it does not meet some Flink 
expectancies (cf comments above). So the question is reduced to: is it worth 
investing some time to make an ES source still? Regarding the thread on an ES 
table source, I'll read it and comment if I have anything useful to say.

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-06-03 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124925#comment-17124925
 ] 

Chesnay Schepler commented on FLINK-17961:
--

There's an ongoing discussion about adding an elasticsearch sink on the mailing 
list, with a design document; primarily aimed at the table API though.

Overall I'm somewhat skeptic given how many different ways there are to read 
data from Elasticsearch; with basic queries only returning a limited number of 
values and pagination/scrolling working slightly differently and partially 
relying on the initial query (after_key).
Unless there's a client that abstracts all these differences it seems difficult 
to implement in a way where users don't have to configure a lot themselves.

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-06-03 Thread Aljoscha Krettek (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124918#comment-17124918
 ] 

Aljoscha Krettek commented on FLINK-17961:
--

I think that's a good idea, I see ES more as a "side source" than a main 
source. People would probably use it to enrich a main stream of data? 
Unfortunately we don't have a general API for that currently, so yes, people 
would have to manually implement such lookups.

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-05-29 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119635#comment-17119635
 ] 

Etienne Chauchot commented on FLINK-17961:
--

Thanks Aljoscha for commenting. ES has data streams features but only for time 
series data; the aim of this source is to read all kind of data. Apart from 
data streams it behaves like a database. You read the content of an index 
(similar to a table) corresponding to the given query (similar to SQL). So, 
regarding streaming changes, if there are changes between 2 read requests, at 
the second the whole index (containing the change) will be read another time. 
Regarding failover: I guess exactly once semantics cannot be guaranteed only at 
least once. Indeed there is no ack mechanism on already read data. Under those 
circumstances, I guess an ES source cannot get into ES. So what should a user 
do to read from ES? Should he send ES requests manually from a Map ?

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-05-28 Thread Aljoscha Krettek (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118872#comment-17118872
 ] 

Aljoscha Krettek commented on FLINK-17961:
--

I think this one is tricky. We should only add a source if we can also support 
it for streaming programs. Meaning it has to be fault-tolerant and support 
reading updates. Is it possible to get, for example, a stream of changes from 
ES?

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-05-28 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118740#comment-17118740
 ] 

Etienne Chauchot commented on FLINK-17961:
--

A question: I'm wondering if the source needs to implement 
_CheckpointedFunction or related._ ES source will be of batch type, executing 
an ES query storing the results and serving them one by one via the 
context.collect. For such a source, is this important to store the elements 
read so far in a checkpointed state in case of failure or should we just ignore 
the checkpointing mechanism and re-read from ES in case of error?

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-05-28 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118405#comment-17118405
 ] 

Etienne Chauchot commented on FLINK-17961:
--

[~aljoscha] can you assign me this ticket?

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17961) Create an Elasticsearch source

2020-05-27 Thread Etienne Chauchot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117590#comment-17117590
 ] 

Etienne Chauchot commented on FLINK-17961:
--

I was thinking about implementing it as a _RichParallelSourceFunction_ with 
similar architecture as the simple _IntegerSource._ Is this the correct way to 
go ?

Also, to connect to ES there are low level Rest client that supports all 
versions of ES but with a very low level REST string API and there is the high 
level REST client that has versions for each ES backend. As the existing ES 
sink uses high level REST client and as packages are already divided into ES 
version specific modules, it is better to keep this organization and use the 
high level client.

> Create an Elasticsearch source
> --
>
> Key: FLINK-17961
> URL: https://issues.apache.org/jira/browse/FLINK-17961
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Reporter: Etienne Chauchot
>Priority: Minor
>
> There is only an Elasticsearch sink available. There are opensource github 
> repos such as [this 
> one|[https://github.com/mnubo/flink-elasticsearch-source-connector]]. Also 
> the apache bahir project does not provide an Elasticsearch source connector 
> for flink either. IMHO I think the project would benefit from having an 
> bundled source connector for ES alongside with the available sink connector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)