Re: Spark cluster multi tenancy

2015-08-26 Thread Jerrick Hoang
Would be interested to know the answer too.

On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood  wrote:

> Interestingly, if there is nothing running on dev spark-shell, it recovers
> successfully and regains the lost executors. Attaching the log for that.
> Notice, the "Registering block manager .." statements in the very end after
> all executors were lost.
>
> On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood 
> wrote:
>
>> Attaching log for when the dev job gets stuck (once all its executors are
>> lost due to preemption). This is a spark-shell job running in yarn-client
>> mode.
>>
>> On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood 
>> wrote:
>>
>>> Hi All,
>>>
>>> We've set up our spark cluster on aws running on yarn (running on hadoop
>>> 2.3) with fair scheduling and preemption turned on. The cluster is shared
>>> for prod and dev work where prod runs with a higher fair share and can
>>> preempt dev jobs if there are not enough resources available for it.
>>> It appears that dev jobs which get preempted often get unstable after
>>> losing some executors and the whole jobs gets stuck (without making any
>>> progress) or end up getting restarted (and hence losing all the work done).
>>> Has someone encountered this before ? Is the solution just to set 
>>> spark.task.maxFailures
>>> to a really high value to recover from task failures in such scenarios? Are
>>> there other approaches that people have taken for spark multi tenancy that
>>> works better in such scenario?
>>>
>>> Thanks,
>>> Sadhan
>>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
@Michael: would listStatus calls read the actual parquet footers within the
folders?

On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott 
wrote:

> Can you please remove me from this distribution list?
>
>
>
> (Filling up my inbox too fast)
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Monday, August 24, 2015 2:13 PM
> *To:* Philip Weaver 
> *Cc:* Jerrick Hoang ; Raghavendra Pandey <
> raghavendra.pan...@gmail.com>; User ; Cheng, Hao <
> hao.ch...@intel.com>
>
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I think we are mostly bottlenecked at this point by how fast we can make
> listStatus calls to discover the folders.  That said, we are happy to
> accept suggestions or PRs to make this faster.  Perhaps you can describe
> how your home grown partitioning works?
>
>
>
> On Sun, Aug 23, 2015 at 7:38 PM, Philip Weaver 
> wrote:
>
> 1 minute to discover 1000s of partitions -- yes, that is what I have
> observed. And I would assert that is very slow.
>
>
>
> On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust 
> wrote:
>
> We should not be actually scanning all of the data of all of the
> partitions, but we do need to at least list all of the available
> directories so that we can apply your predicates to the actual values that
> are present when we are deciding which files need to be read in a given
> spark job.  While this is a somewhat expensive operation, we do it in
> parallel and we cache this information when you access the same relation
> more than once.
>
>
>
> Can you provide a little more detail about how exactly you are accessing
> the parquet data (are you using sqlContext.read or creating persistent
> tables in the metastore?), and how long it is taking?  It would also be
> good to know how many partitions we are talking about and how much data is
> in each.  Finally, I'd like to see the stacktrace where it is hanging to
> make sure my above assertions are correct.
>
>
>
> We have several tables internally that have 1000s of partitions and while
> it takes ~1 minute initially to discover the metadata, after that we are
> able to query the data interactively.
>
>
>
>
>
>
>
> On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang 
> wrote:
>
> anybody has any suggestions?
>
>
>
> On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang 
> wrote:
>
> Is there a workaround without updating Hadoop? Would really appreciate if
> someone can explain what spark is trying to do here and what is an easy way
> to turn this off. Thanks all!
>
>
>
> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
> Did you try with hadoop version 2.7.1 .. It is known that s3a works really
> well with parquet which is available in 2.7. They fixed lot of issues
> related to metadata reading there...
>
> On Aug 21, 2015 11:24 PM, "Jerrick Hoang"  wrote:
>
> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>
>
>
> (table is partitioned by date_prefix and hour)
>
> explain select count(*) from test_table where date_prefix='20150819' and
> hour='00';
>
>
>
> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>
>  TungstenExchange SinglePartition
>
>   TungstenAggregate(key=[],
> value=[(count(1),mode=Partial,isDistinct=false)]
>
>Scan ParquetRelation[ ..  ]
>
>
>
> Why does spark have to scan all partitions when the query only concerns
> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>
>
>
> Thanks!
>
>
>
> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver 
> wrote:
>
> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
> I couldn't find much information about it online. What does it mean exactly
> to disable it? Are there any negative consequences to disabling it?
>
>
>
> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao  wrote:
>
> Can you make some more profiling? I am wondering if the driver is busy
> with scanning the HDFS / S3.
>
> Like jstack 
>
>
>
> And also, it’s will be great if you can paste the physical plan for the
> simple query.
>
>
>
> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
> *Sent:* Thursday, August 20, 2015 1:46 PM
> *To:* Cheng, Hao
> *Cc:* Philip Weaver; user
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
> CLs trying to speed up spark sql with tables with a huge number of

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
anybody has any suggestions?

On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang 
wrote:

> Is there a workaround without updating Hadoop? Would really appreciate if
> someone can explain what spark is trying to do here and what is an easy way
> to turn this off. Thanks all!
>
> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> Did you try with hadoop version 2.7.1 .. It is known that s3a works
>> really well with parquet which is available in 2.7. They fixed lot of
>> issues related to metadata reading there...
>> On Aug 21, 2015 11:24 PM, "Jerrick Hoang"  wrote:
>>
>>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>>>
>>> (table is partitioned by date_prefix and hour)
>>> explain select count(*) from test_table where date_prefix='20150819' and
>>> hour='00';
>>>
>>> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>>>  TungstenExchange SinglePartition
>>>   TungstenAggregate(key=[],
>>> value=[(count(1),mode=Partial,isDistinct=false)]
>>>Scan ParquetRelation[ ..  ]
>>>
>>> Why does spark have to scan all partitions when the query only concerns
>>> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>>>
>>> Thanks!
>>>
>>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver 
>>> wrote:
>>>
>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
>>>> and I couldn't find much information about it online. What does it mean
>>>> exactly to disable it? Are there any negative consequences to disabling it?
>>>>
>>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao 
>>>> wrote:
>>>>
>>>>> Can you make some more profiling? I am wondering if the driver is busy
>>>>> with scanning the HDFS / S3.
>>>>>
>>>>> Like jstack 
>>>>>
>>>>>
>>>>>
>>>>> And also, it’s will be great if you can paste the physical plan for
>>>>> the simple query.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>>>> *To:* Cheng, Hao
>>>>> *Cc:* Philip Weaver; user
>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>>> partitions
>>>>>
>>>>>
>>>>>
>>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple
>>>>> of CLs trying to speed up spark sql with tables with a huge number of
>>>>> partitions, I've made sure that those CLs are included but it's still very
>>>>> slow
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao 
>>>>> wrote:
>>>>>
>>>>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled
>>>>> to false.
>>>>>
>>>>>
>>>>>
>>>>> BTW, which version are you using?
>>>>>
>>>>>
>>>>>
>>>>> Hao
>>>>>
>>>>>
>>>>>
>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>>>> *To:* Philip Weaver
>>>>> *Cc:* user
>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>>> partitions
>>>>>
>>>>>
>>>>>
>>>>> I guess the question is why does spark have to do partition discovery
>>>>> with all partitions when the query only needs to look at one partition? Is
>>>>> there a conf flag to turn this off?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <
>>>>> philip.wea...@gmail.com> wrote:
>>>>>
>>>>> I've had the same problem. It turns out that Spark (specifically
>>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not 
>>>>> yet
>>>>> released), but was still unacceptably slow. Sadly, we ended up reading
>>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>>>>> because of this problem.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I did a simple experiment with Spark SQL. I created a partitioned
>>>>> parquet table with only one partition (date=20140701). A simple `select
>>>>> count(*) from table where date=20140701` would run very fast (0.1 
>>>>> seconds).
>>>>> However, as I added more partitions the query takes longer and longer. 
>>>>> When
>>>>> I added about 10,000 partitions, the query took way too long. I feel like
>>>>> querying for a single partition should not be affected by having more
>>>>> partitions. Is this a known behaviour? What does spark try to do here?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jerrick
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>


Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
Is there a workaround without updating Hadoop? Would really appreciate if
someone can explain what spark is trying to do here and what is an easy way
to turn this off. Thanks all!

On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> Did you try with hadoop version 2.7.1 .. It is known that s3a works really
> well with parquet which is available in 2.7. They fixed lot of issues
> related to metadata reading there...
> On Aug 21, 2015 11:24 PM, "Jerrick Hoang"  wrote:
>
>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>>
>> (table is partitioned by date_prefix and hour)
>> explain select count(*) from test_table where date_prefix='20150819' and
>> hour='00';
>>
>> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>>  TungstenExchange SinglePartition
>>   TungstenAggregate(key=[],
>> value=[(count(1),mode=Partial,isDistinct=false)]
>>Scan ParquetRelation[ ..  ]
>>
>> Why does spark have to scan all partitions when the query only concerns
>> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>>
>> Thanks!
>>
>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver 
>> wrote:
>>
>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
>>> and I couldn't find much information about it online. What does it mean
>>> exactly to disable it? Are there any negative consequences to disabling it?
>>>
>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao 
>>> wrote:
>>>
>>>> Can you make some more profiling? I am wondering if the driver is busy
>>>> with scanning the HDFS / S3.
>>>>
>>>> Like jstack 
>>>>
>>>>
>>>>
>>>> And also, it’s will be great if you can paste the physical plan for the
>>>> simple query.
>>>>
>>>>
>>>>
>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>>> *To:* Cheng, Hao
>>>> *Cc:* Philip Weaver; user
>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>> partitions
>>>>
>>>>
>>>>
>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
>>>> CLs trying to speed up spark sql with tables with a huge number of
>>>> partitions, I've made sure that those CLs are included but it's still very
>>>> slow
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao 
>>>> wrote:
>>>>
>>>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled
>>>> to false.
>>>>
>>>>
>>>>
>>>> BTW, which version are you using?
>>>>
>>>>
>>>>
>>>> Hao
>>>>
>>>>
>>>>
>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>>> *To:* Philip Weaver
>>>> *Cc:* user
>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>> partitions
>>>>
>>>>
>>>>
>>>> I guess the question is why does spark have to do partition discovery
>>>> with all partitions when the query only needs to look at one partition? Is
>>>> there a conf flag to turn this off?
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
>>>> wrote:
>>>>
>>>> I've had the same problem. It turns out that Spark (specifically
>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not yet
>>>> released), but was still unacceptably slow. Sadly, we ended up reading
>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>>>> because of this problem.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I did a simple experiment with Spark SQL. I created a partitioned
>>>> parquet table with only one partition (date=20140701). A simple `select
>>>> count(*) from table where date=20140701` would run very fast (0.1 seconds).
>>>> However, as I added more partitions the query takes longer and longer. When
>>>> I added about 10,000 partitions, the query took way too long. I feel like
>>>> querying for a single partition should not be affected by having more
>>>> partitions. Is this a known behaviour? What does spark try to do here?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Jerrick
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>


Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
@Cheng, Hao : Physical plans show that it got stuck on scanning S3!

(table is partitioned by date_prefix and hour)
explain select count(*) from test_table where date_prefix='20150819' and
hour='00';

TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)]
   Scan ParquetRelation[ ..  ]

Why does spark have to scan all partitions when the query only concerns
with 1 partitions? Doesn't it defeat the purpose of partitioning?

Thanks!

On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver 
wrote:

> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
> I couldn't find much information about it online. What does it mean exactly
> to disable it? Are there any negative consequences to disabling it?
>
> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao  wrote:
>
>> Can you make some more profiling? I am wondering if the driver is busy
>> with scanning the HDFS / S3.
>>
>> Like jstack 
>>
>>
>>
>> And also, it’s will be great if you can paste the physical plan for the
>> simple query.
>>
>>
>>
>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>> *Sent:* Thursday, August 20, 2015 1:46 PM
>> *To:* Cheng, Hao
>> *Cc:* Philip Weaver; user
>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>> partitions
>>
>>
>>
>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
>> CLs trying to speed up spark sql with tables with a huge number of
>> partitions, I've made sure that those CLs are included but it's still very
>> slow
>>
>>
>>
>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao  wrote:
>>
>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
>> false.
>>
>>
>>
>> BTW, which version are you using?
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
>> *Sent:* Thursday, August 20, 2015 12:16 PM
>> *To:* Philip Weaver
>> *Cc:* user
>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>> partitions
>>
>>
>>
>> I guess the question is why does spark have to do partition discovery
>> with all partitions when the query only needs to look at one partition? Is
>> there a conf flag to turn this off?
>>
>>
>>
>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
>> wrote:
>>
>> I've had the same problem. It turns out that Spark (specifically parquet)
>> is very slow at partition discovery. It got better in 1.5 (not yet
>> released), but was still unacceptably slow. Sadly, we ended up reading
>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>> because of this problem.
>>
>>
>>
>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
>> wrote:
>>
>> Hi all,
>>
>>
>>
>> I did a simple experiment with Spark SQL. I created a partitioned parquet
>> table with only one partition (date=20140701). A simple `select count(*)
>> from table where date=20140701` would run very fast (0.1 seconds). However,
>> as I added more partitions the query takes longer and longer. When I added
>> about 10,000 partitions, the query took way too long. I feel like querying
>> for a single partition should not be affected by having more partitions. Is
>> this a known behaviour? What does spark try to do here?
>>
>>
>>
>> Thanks,
>>
>> Jerrick
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs
trying to speed up spark sql with tables with a huge number of partitions,
I've made sure that those CLs are included but it's still very slow

On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao  wrote:

> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
> false.
>
>
>
> BTW, which version are you using?
>
>
>
> Hao
>
>
>
> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
> *Sent:* Thursday, August 20, 2015 12:16 PM
> *To:* Philip Weaver
> *Cc:* user
> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
> partitions
>
>
>
> I guess the question is why does spark have to do partition discovery with
> all partitions when the query only needs to look at one partition? Is there
> a conf flag to turn this off?
>
>
>
> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
> wrote:
>
> I've had the same problem. It turns out that Spark (specifically parquet)
> is very slow at partition discovery. It got better in 1.5 (not yet
> released), but was still unacceptably slow. Sadly, we ended up reading
> parquet files manually in Python (via C++) and had to abandon Spark SQL
> because of this problem.
>
>
>
> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
> wrote:
>
> Hi all,
>
>
>
> I did a simple experiment with Spark SQL. I created a partitioned parquet
> table with only one partition (date=20140701). A simple `select count(*)
> from table where date=20140701` would run very fast (0.1 seconds). However,
> as I added more partitions the query takes longer and longer. When I added
> about 10,000 partitions, the query took way too long. I feel like querying
> for a single partition should not be affected by having more partitions. Is
> this a known behaviour? What does spark try to do here?
>
>
>
> Thanks,
>
> Jerrick
>
>
>
>
>


Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with
all partitions when the query only needs to look at one partition? Is there
a conf flag to turn this off?

On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
wrote:

> I've had the same problem. It turns out that Spark (specifically parquet)
> is very slow at partition discovery. It got better in 1.5 (not yet
> released), but was still unacceptably slow. Sadly, we ended up reading
> parquet files manually in Python (via C++) and had to abandon Spark SQL
> because of this problem.
>
> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
> wrote:
>
>> Hi all,
>>
>> I did a simple experiment with Spark SQL. I created a partitioned parquet
>> table with only one partition (date=20140701). A simple `select count(*)
>> from table where date=20140701` would run very fast (0.1 seconds). However,
>> as I added more partitions the query takes longer and longer. When I added
>> about 10,000 partitions, the query took way too long. I feel like querying
>> for a single partition should not be affected by having more partitions. Is
>> this a known behaviour? What does spark try to do here?
>>
>> Thanks,
>> Jerrick
>>
>
>


Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all,

I did a simple experiment with Spark SQL. I created a partitioned parquet
table with only one partition (date=20140701). A simple `select count(*)
from table where date=20140701` would run very fast (0.1 seconds). However,
as I added more partitions the query takes longer and longer. When I added
about 10,000 partitions, the query took way too long. I feel like querying
for a single partition should not be affected by having more partitions. Is
this a known behaviour? What does spark try to do here?

Thanks,
Jerrick


Refresh table

2015-08-10 Thread Jerrick Hoang
Hi all,

I'm a little confused about how refresh table (SPARK-5833) should work. So
I did the following,

val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")

df1.write.parquet("hdfs:///test_table/key=1")


Then I created an external table by doing,

CREATE EXTERNAL TABLE `tmp_table` (
`single`: int,
`double`: int)
PARTITIONED BY (
  `key` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs:///test_table/'

Then I added the partition to the table by `alter table tmp_table add
partition (key=1) location 'hdfs://..`

Then I added a new partition with different schema by,

val df2 = sc.makeRDD(1 to 5).map(i => (i, i * 3)).toDF("single", "triple")

df2.write.parquet("hdfs:///test_table/key=2")


And added the new partition to the table by `alter table ..`,

But when I did `refresh table tmp_table` and `describe table` it couldn't
pick up the new column `triple`. Can someone explain to me how partition
discovery and schema merging of refresh table should work?

Thanks


Re: Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian

On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian  wrote:

> It doesn't seem to be Parquet 1.7.0 since the package name isn't under
> "org.apache.parquet" (1.7.0 is the first official Apache release of
> Parquet). The version you were using is probably Parquet 1.6.0rc3 according
> to the line number information:
> https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249
> And you're hitting PARQUET-136, which has been fixed in (the real) Parquet
> 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136
>
> Cheng
>
>
> On 8/8/15 6:20 AM, Jerrick Hoang wrote:
>
> Hi all,
>
> I have a partitioned parquet table (very small table with only 2
> partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
> applied this patch to spark [SPARK-7743] so I assume that spark can read
> parquet files normally, however, I'm getting this when trying to do a
> simple `select count(*) from table`,
>
> ```org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
> stage 44.0: java.lang.NullPointerException
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
> at
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
> at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)```
>
> Has anybody seen this before?
>
> Thanks
>
>
>


Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Hi all,

I have a partitioned parquet table (very small table with only 2
partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
applied this patch to spark [SPARK-7743] so I assume that spark can read
parquet files normally, however, I'm getting this when trying to do a
simple `select count(*) from table`,

```org.apache.spark.SparkException: Job aborted due to stage failure: Task
29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
stage 44.0: java.lang.NullPointerException
at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```

Has anybody seen this before?

Thanks


Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Jerrick Hoang
how big is the dataset? how complicated is the query?

On Sun, Jul 26, 2015 at 12:47 AM Louis Hust  wrote:

> Hi, all,
>
> I am using spark DataFrame to fetch small table from MySQL,
> and i found it cost so much than directly access MySQL Using JDBC.
>
> Time cost for Spark is about 2033ms, and direct access at about 16ms.
>
> Code can be found at:
>
>
> https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java
>
> So If my configuration for spark is wrong? How to optimise Spark to
> achieve the similar performance like direct access?
>
> Any idea will be appreciated!
>
>


Re: Spark-hive parquet schema evolution

2015-07-21 Thread Jerrick Hoang
Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table some
partitions and I add a new column A to parquet schema and write some data
with the new schema to a new partition in the table. If i'm not mistaken,
if I do a sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe table' from
SparkSQLCLI I won't see the new column being added. I understand that this
is because Hive doesn't support schema evolution. So what is the best way
to support CLI queries in this situation? Do I need to manually alter the
table everytime the underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian  wrote:

>  Hey Jerrick,
>
> What do you mean by "schema evolution with Hive metastore tables"? Hive
> doesn't take schema evolution into account. Could you please give a
> concrete use case? Are you trying to write Parquet data with extra columns
> into an existing metastore Parquet table?
>
> Cheng
>
>
> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>
> I'm new to Spark, any ideas would be much appreciated! Thanks
>
> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
> wrote:
>
>> Hi all,
>>
>>  I'm aware of the support for schema evolution via DataFrame API. Just
>> wondering what would be the best way to go about dealing with schema
>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>> CLI, how would I deal with Parquet schema evolution?
>>
>>  Thanks,
>> J
>>
>
>
>


Re: Spark-hive parquet schema evolution

2015-07-20 Thread Jerrick Hoang
I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
wrote:

> Hi all,
>
> I'm aware of the support for schema evolution via DataFrame API. Just
> wondering what would be the best way to go about dealing with schema
> evolution with Hive metastore tables. So, say I create a table via SparkSQL
> CLI, how would I deal with Parquet schema evolution?
>
> Thanks,
> J
>


Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang
Hi all,

I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
evolution with Hive metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?

Thanks,
J


Re: Getting not implemented by the TFS FileSystem implementation

2015-07-16 Thread Jerrick Hoang
So, this has to do with the fact that 1.4 has a new way to interact with
HiveMetastore, still investigating. Would really appreciate if anybody has
any insights :)

On Tue, Jul 14, 2015 at 4:28 PM, Jerrick Hoang 
wrote:

> Hi all,
>
> I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql
> CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented
> by the TFS FileSystem implementation``` exception. I did not get this error
> with 1.3 and I don't use any TFS FileSystem. Full stack trace is
>
> ```Exception in thread "main" java.lang.RuntimeException:
> java.lang.UnsupportedOperationException: Not implemented by the TFS
> FileSystem implementation
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:105)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:166)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
> at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
> at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205)
> at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:204)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:71)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:248)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.UnsupportedOperationException: Not implemented by the
> TFS FileSystem implementation
> at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
> at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
> ... 31 more```
>
> Thanks all
>


Getting not implemented by the TFS FileSystem implementation

2015-07-14 Thread Jerrick Hoang
Hi all,

I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql
CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented
by the TFS FileSystem implementation``` exception. I did not get this error
with 1.3 and I don't use any TFS FileSystem. Full stack trace is

```Exception in thread "main" java.lang.RuntimeException:
java.lang.UnsupportedOperationException: Not implemented by the TFS
FileSystem implementation
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:105)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:166)
at
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:204)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:71)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:248)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.UnsupportedOperationException: Not implemented by the
TFS FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
... 31 more```

Thanks all


hive-site.xml spark1.3

2015-07-13 Thread Jerrick Hoang
Hi all,

I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql
CLI doesn't pick it up. (copying the same conf/ files to spark1.4 and 1.2
works fine). Just wondering if someone has seen this before,
Thanks


Re: Basic Spark SQL question

2015-07-13 Thread Jerrick Hoang
Well for adhoc queries you can use the CLI

On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez 
wrote:

> Hi,
>   I have a question for Spark SQL. Is there a way to be able to use Spark
> SQL on YARN without having to submit a job?
>   Bottom line here is I want to be able to reduce the latency of running
> queries as a job. I know that the spark sql default submission is like a
> job, but was wondering if it's possible to run queries like one would with
> a regular db like MySQL or Oracle.
>
> Thanks,
> Ron
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Sorry all for not being clear. I'm using spark 1.4 and the table is a hive
table, and the table is partitioned.

On Sun, Jul 12, 2015 at 6:36 PM, Yin Huai  wrote:

> Jerrick,
>
> Let me ask a few clarification questions. What is the version of Spark? Is
> the table a hive table? What is the format of the table? Is the table
> partitioned?
>
> Thanks,
>
> Yin
>
> On Sun, Jul 12, 2015 at 6:01 PM, ayan guha  wrote:
>
>> Describe computes statistics, so it will try to query the table. The one
>> you are looking for is df.printSchema()
>>
>> On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm new to Spark and this question may be trivial or has already been
>>> answered, but when I do a 'describe table' from SparkSQL CLI it seems to
>>> try looking at all records at the table (which takes a really long time for
>>> big table) instead of just giving me the metadata of the table. Would
>>> appreciate if someone can give me some pointers, thanks!
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Hi all,

I'm new to Spark and this question may be trivial or has already been
answered, but when I do a 'describe table' from SparkSQL CLI it seems to
try looking at all records at the table (which takes a really long time for
big table) instead of just giving me the metadata of the table. Would
appreciate if someone can give me some pointers, thanks!