Flink Query Optimizer

2018-07-14 Thread Albert Jonathan
Hello,

I am just wondering, does Flink use Apache Calcite's query optimizer to
generate an optimal logical plan for stream queries, or does it have its
own independent query optimizer?
>From what I observed so far, the Flink's query optimizer only groups
operator together without changing the order of aggregation operators
(e.g., join). Did I miss anything?

I am thinking of extending Flink to apply query optimization as in the
context of DBMS by either integrating it with Calcite or implementing it as
a new module.
Any feedback or guidelines will be highly appreciated.

Thank you,
Albert


Re: Flink session on Yarn - ClassNotFoundException

2017-09-01 Thread Albert Giménez
Thanks for the replies :)

I managed to get it working following the instructions here 
<https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/building.html#vendor-specific-versions>,
 but I found a few issues that I guess were specific to HDInsight, or at least 
to the HDP version it uses. Trying to summarize:

Hadoop version 
After running “hadoop version”, the result was “2.7.3.2.6.1.3-4”.
However, when building I was getting errors that some dependencies from the 
Hortonworks repo were not found, for instance zookeeper “3.4.6.2.6.1.3-4”.
I browsed to the Hortonworks repo 
<http://repo.hortonworks.com/content/repositories/releases/org/apache/zookeeper/zookeeper/>
 to find a suitable version, so I ended up using 2.7.3.2.6.1.31-3 instead.

Scala version
I also had issues with dependencies if I tried using Scala version 2.11.11, so 
I compiled agains 2.11.7.

So, the maven command I used was this:
mvn install -DskipTests -Dscala.version=2.11.7 -Pvendor-repos 
-Dhadoop.version=2.7.3.2.6.1.31-3

Azure Jars
With all that, I still had class not found errors errors when trying to start 
my Flink session, for instance "java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.adl.HdiAdlFileSystem”.
To fix that, I had to find the azure-specific jars I needed to use. For that, I 
checked which jars Spark was using and copied / symlinked them into Flink’s 
“lib” directory:
/usr/hdp/current/spark2-client/jars/*azure*.jar
/usr/lib/hdinsight-datalake/adls2-oauth2-token-provider.jar

Guava Jar
Finally, for some reason my jobs were failing (the Cassandra driver was 
complaining about the Guava version being too old, although I had the right 
version in my assembled jar).
I just downloaded the version I needed (in my case, 23.0 
<http://central.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar>) 
and also put that into Flink’s lib directory.

I hope it helps other people trying to run Flink on Azure HDInsight :)

Kind regards,

Albert

> On Aug 31, 2017, at 8:18 PM, Banias H <banias4sp...@gmail.com> wrote:
> 
> We had the same issue. Get the hdp version, from 
> /usr/hdp/current/hadoop-client/hadoop-common-.jar for example. Then 
> rebuild flink from src:
> mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=
> 
> for example: mvn clean install -DskipTests -Pvendor-repos 
> -Dhadoop.version=2.7.3.2.6.1.0-129
> 
> Copy and setup build-target/ to the cluster. Export HADOOP_CONF_DIR and 
> YARN_CONF_DIR according to your env. You should have no problem starting the 
> session.
> 
> 
> On Wed, Aug 30, 2017 at 6:45 AM, Federico D'Ambrosio <fedex...@gmail.com 
> <mailto:fedex...@gmail.com>> wrote:
> Hi,
> What is your "hadoop version" output? I'm asking because you said your hadoop 
> distribution is in /usr/hdp so it looks like you're using Hortonworks HDP, 
> just like myself. So, this would be a third party distribution and you'd need 
> to build Flink from source according to this: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/building.html#vendor-specific-versions
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/building.html#vendor-specific-versions>
> 
> Federico D'Ambrosio
> 
> Il 30 ago 2017 13:33, "albert" <alb...@datacamp.com 
> <mailto:alb...@datacamp.com>> ha scritto:
> Hi Chesnay,
> 
> Thanks for your reply. I did download the binaries matching my Hadoop
> version (2.7), that's why I was wondering if the issue had something to do
> with the exact hadoop version flink is compiled again, or if there might be
> things that are missing in my environment.
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
> 



Re: Flink session on Yarn - ClassNotFoundException

2017-08-30 Thread albert
Hi Chesnay,

Thanks for your reply. I did download the binaries matching my Hadoop
version (2.7), that's why I was wondering if the issue had something to do
with the exact hadoop version flink is compiled again, or if there might be
things that are missing in my environment.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Flink session on Yarn - ClassNotFoundException

2017-08-29 Thread Albert Giménez
Hi,

I’m trying to start a flink (1.3.2) session as explained in the docs 
(https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/yarn_setup.html#start-a-session
 
),
 but I keep getting the “ClassNotFoundException” you can see below.

I’m running an HDInsight cluster on Azure (HDFS and YARN v2.7.3), and I’m 
exporting my HADOOP_CONF_DIR before running the yarn-session command.

Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:443)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:483)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:483)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2227)
at 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:161)
at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:94)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.createRMProxy(ClientRMProxy.java:72)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:187)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.getYarnClient(AbstractYarnClusterDescriptor.java:381)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:458)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:441)
... 9 more
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2219)
... 17 more
Caused by: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 18 more


So far, I’ve check that the yarn class path is correct (it is), I also tried 
manually linking the “hadoop-yarn-common" jar from my /usr/hdp directories into 
Flink’s “lib” directory. In that case, I get an “IllegalAccessError”:

Exception in thread "main" java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal()Ljava/lang/Object;
 from class org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider
at 
org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)
at 
org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:163)
at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:94)
at 
org.apache.hadoop.yarn.client.ClientRMProxy.createRMProxy(ClientRMProxy.java:72)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:187)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.getYarnClient(AbstractYarnClusterDescriptor.java:381)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:458)
at 
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:441)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630)
at 
org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486)
at 

Re: Multiple consumers on a subpartition

2017-04-26 Thread Albert Jonathan
Thank you for your responses and suggestions. I appreciate it.

Albert

On Wed, Apr 26, 2017 at 4:19 AM, Ufuk Celebi <u...@apache.org> wrote:

> Adding to what Zhijiang said: I think the way to go would be to create
> multiple "read views" over the pipelined subpartition. You would have
> to make sure that the initial reference count of the partition buffers
> is incremented accordingly. The producer will be back pressured by
> both consumers now. This could be undesired in some scenarios.
> Currently, both consumers are independent of each other by creating
> multiple partitions (with their own subpartitions) for each consumer.
>
>
>
> On Wed, Apr 26, 2017 at 5:58 AM, Zhijiang(wangzhijiang999)
> <wangzhijiang...@aliyun.com> wrote:
> > Hi albert,
> >
> > As I know, if the upstream data will be consumed by multiple
> consumers,
> > it will generate multiple subpartitions, and each subpartition will
> > correspond to one input channel consumer.
> > So it is one-to-one correspondence among subpartition -> subpartition
> view
> > -> input channel.
> >
> > cheers,
> > zhijiang
> >
> > --
> > 发件人:albertjonathan <alb...@cs.umn.edu>
> > 发送时间:2017年4月26日(星期三) 02:37
> > 收件人:user <user@flink.apache.org>
> > 主 题:Multiple consumers on a subpartition
> >
> > Hello,
> >
> > Is there a way Flink allow a (pipelined) subpartition to be consumed by
> > multiple consumers? If not, would it make more sense to implement it as
> > multiple input channels for a single subpartition or multiple
> subpartition
> > views for each input channel?
> >
> > Any suggestion is appreciated.
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Multiple-consumers-on-a-subpartition-tp12809.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive
> > at Nabble.com.
> >
> >
>