I'll attempt to answer a few of your questions: There are no limitations with regard to the number of dimension or lookup tables for Spark. As long as you have disk space, you should have no problem. Obviously, if you do joins among dozens or hundreds of tables it may take a while since it's unlikely that you can cache all of the tables. You may be able to cache (temporary) lookup tables, which means that joins to the fact table(s) would be a lot faster.
This also means that for Spark there is no additional direct cost. You may need more hardware because of the storage requirements and perhaps also more RAM to be able to handle more cached tables and concurrency. With Spark you can at least choose to persist tables in memory and spill to disk only when necessary. MapReduce is 100% disk-based, for instance. Windowing functions are supported either by HiveQL (i.e. via SQLContext.sql or HiveContext.sql - in Spark 2.0 these will have the same entry point) or via API functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ In the API you'll also find other functions you're looking for. Moreover, you can also check out the documentation for Hive because that's also available. For instance, in 1.6 there is no equivalent to Hive's LATERAL VIEW OUTER, but since in the sql() method you have access to that, it's not a limitation. There just is no native method in the API. Technically there are no limitations on joins, although for bigger tables they will take longer. Caching really helps you out here. Nested queries are no problem, you can always use SQLContext or HiveContext.sql, which gives you a normal SQL interface. Spark has APIs for Scala, Java, Python and R. By the way, I assume you mean 'a billion rows'. Most of your questions are answered on the official Spark pages, so please have a look there too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/List-of-questios-about-spark-tp27027p27033.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org