Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
spark.sql.inMemoryColumnarStorage.compressed to true. This property is already set to true by default in master branch and branch-1.2. On 11/13/14 7:16 AM, Sadhan Sood wrote: We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory 1TB and when trying to cache a table partition(which is 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
(Logging.scala:logError(75)) - Asked to remove non-existent executor 372 2014-11-12 19:11:21,655 INFO scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Executor lost: 372 (epoch 3) On Wed, Nov 12, 2014 at 12:31 PM, Sadhan Sood sadhan.s...@gmail.com wrote: We are running spark on yarn with combined

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
an output location for shuffle 0 The data is lzo compressed sequence file with compressed size ~ 26G. Is there a way to understand why shuffle keeps failing for one partition. I believe we have enough memory to store the uncompressed data in memory. On Wed, Nov 12, 2014 at 2:50 PM, Sadhan Sood sadhan.s

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Sadhan Sood
didn't start successfully because the HiveServer2 occupied the port, and your Beeline session was probably linked against HiveServer2. Cheng On 11/11/14 8:29 AM, Sadhan Sood wrote: I was testing out the spark thrift jdbc server by running a simple query in the beeline client. The spark

Partition caching taking too long

2014-11-11 Thread Sadhan Sood
While testing SparkSQL on top of our Hive metastore, we were trying to cache the data for one partition of the table in memory like this: CACHE TABLE xyz_20141029 AS SELECT * FROM xyz where date_prefix = 20141029 Table xyz is a hive table which is partitioned with date_prefix. The data is

getting exception when trying to build spark from master

2014-11-10 Thread Sadhan Sood
Getting an exception while trying to build spark in spark-core: [ERROR] while compiling: /Users/dev/tellapart_spark/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4

Re: getting exception when trying to build spark from master

2014-11-10 Thread Sadhan Sood
are broken, too. Based on the Jenkins logs, I think that this pull request may have broken things (although I'm not sure why): https://github.com/apache/spark/pull/3030#issuecomment-62436181 On Mon, Nov 10, 2014 at 1:42 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Getting

thrift jdbc server probably running queries as hive query

2014-11-10 Thread Sadhan Sood
I was testing out the spark thrift jdbc server by running a simple query in the beeline client. The spark itself is running on a yarn cluster. However, when I run a query in beeline - I see no running jobs in the spark UI(completely empty) and the yarn UI seem to indicate that the submitted query

Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Sadhan Sood
We want to run multiple instances of spark sql cli on our yarn cluster. Each instance of the cli is to be used by a different user. This would be non-optimal if each user brings up a different cli given how spark works on yarn by running executor processes (and hence consuming resources) on worker

Fwd: Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Sadhan Sood
We want to run multiple instances of spark sql cli on our yarn cluster. Each instance of the cli is to be used by a different user. This looks non-optimal if each user brings up a different cli given how spark works on yarn by running executor processes (and hence consuming resources) on worker