Re: List of questios about spark

Ian Thu, 26 May 2016 10:58:09 -0700

I'll attempt to answer a few of your questions:

There are no limitations with regard to the number of dimension or lookup
tables for Spark. As long as you have disk space, you should have no
problem. Obviously, if you do joins among dozens or hundreds of tables it
may take a while since it's unlikely that you can cache all of the tables.
You may be able to cache (temporary) lookup tables, which means that joins
to the fact table(s) would be a lot faster.

This also means that for Spark there is no additional direct cost. You may
need more hardware because of the storage requirements and perhaps also more
RAM to be able to handle more cached tables and concurrency. With Spark you
can at least choose to persist tables in memory and spill to disk only when
necessary. MapReduce is 100% disk-based, for instance.

Windowing functions are supported either by HiveQL (i.e. via SQLContext.sql
or HiveContext.sql - in Spark 2.0 these will have the same entry point) or
via API functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

In the API you'll also find other functions you're looking for. Moreover,
you can also check out the documentation for Hive because that's also
available. For instance, in 1.6 there is no equivalent to Hive's LATERAL
VIEW OUTER, but since in the sql() method you have access to that, it's not
a limitation. There just is no native method in the API.

Technically there are no limitations on joins, although for bigger tables
they will take longer. Caching really helps you out here. Nested queries are
no problem, you can always use SQLContext or HiveContext.sql, which gives
you a normal SQL interface.

Spark has APIs for Scala, Java, Python and R.

By the way, I assume you mean 'a billion rows'. Most of your questions are
answered on the official Spark pages, so please have a look there too.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/List-of-questios-about-spark-tp27027p27033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: List of questios about spark

Reply via email to