Re: List of questios about spark

2016-05-30 Thread Ian
No, the limit is given by your setup. If you use Spark on a YARN cluster,
then the number of concurrent jobs is really limited to the resources
allocated to each job and how the YARN queues are set up. For instance, if
you use the FIFO scheduler (default), then it can be the case that the first
job takes up all the resources and all the others have to wait until the job
is done. If, on the other hand, you use the FAIR scheduler, then the number
of jobs that run concurrently is limited just by what's available on the
cluster in terms of resources.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/List-of-questios-about-spark-tp27027p27045.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: List of questios about spark

2016-05-26 Thread Ian
I'll attempt to answer a few of your questions:

There are no limitations with regard to the number of dimension or lookup
tables for Spark. As long as you have disk space, you should have no
problem. Obviously, if you do joins among dozens or hundreds of tables it
may take a while since it's unlikely that you can cache all of the tables.
You may be able to cache (temporary) lookup tables, which means that joins
to the fact table(s) would be a lot faster.

This also means that for Spark there is no additional direct cost. You may
need more hardware because of the storage requirements and perhaps also more
RAM to be able to handle more cached tables and concurrency. With Spark you
can at least choose to persist tables in memory and spill to disk only when
necessary. MapReduce is 100% disk-based, for instance.

Windowing functions are supported either by HiveQL (i.e. via SQLContext.sql
or HiveContext.sql - in Spark 2.0 these will have the same entry point) or
via API functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

In the API you'll also find other functions you're looking for. Moreover,
you can also check out the documentation for Hive because that's also
available. For instance, in 1.6 there is no equivalent to Hive's LATERAL
VIEW OUTER, but since in the sql() method you have access to that, it's not
a limitation. There just is no native method in the API.

Technically there are no limitations on joins, although for bigger tables
they will take longer. Caching really helps you out here. Nested queries are
no problem, you can always use SQLContext or HiveContext.sql, which gives
you a normal SQL interface.

Spark has APIs for Scala, Java, Python and R.

By the way, I assume you mean 'a billion rows'. Most of your questions are
answered on the official Spark pages, so please have a look there too.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/List-of-questios-about-spark-tp27027p27033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org