If your join predicates are like "x < y" or "x <= y" you do not need to build the full Cartesian product in the Map function. Instead you can sort the broadcasted set or even build an index to reduce the number of predicate evaluations.
2015-02-21 11:40 GMT+01:00 Fabian Hueske <[email protected]>: > Hi, > > non-equi joins are only supported by building the cross product. > This is essentially the nested-loop join strategy, that a conventional > database system would chose. However, such joins are prohibitively > expensive when applied to large data sets. > If you have one small and another large data set, you can do the join by > broadcasting the smaller side to a MapFunction (withBroadcastSet() [1]) > that has the larger data set as regular input and evaluate the join > condition in the MapFunction. > > The problem with the Any key-selector is, that Flink needs to know the > types when the program is optimized because it generates type specific > serializers. I think an Any type does not work as join key. > > Best, Fabian > > > [1] > http://flink.apache.org/docs/0.8/programming_guide.html#broadcast-variables > > 2015-02-21 10:28 GMT+01:00 Vinh June <[email protected]>: > >> Hello, >> >> I have some questions concerning Join: >> >> 1. I would like to make join with different conditions, is there any way >> to >> create a Join with conditions different to "equalTo", for example, how >> would >> I make a join with > or >= >> >> 2. I have a DataSet[Map[String, Any]]. Is it possible to specify >> KeySelector >> using a map key? I tried to use below Scala code but it doesn't work >> >> Set1.join(Set2).where(_.get("key")).equalTo(_.get("key")) >> >> >> >> -- >> View this message in context: >> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Some-questions-about-Join-tp780.html >> Sent from the Apache Flink (Incubator) User Mailing List archive. mailing >> list archive at Nabble.com. >> > >
