Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Pralabh,

You need to check the latest compatibility between Spark version that can
successfully work as Hive execution engine

This is my old file alluding to spark-1.3.1 as the execution engine

set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6;
--set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn-client;
set hive.execution.engine=spark;


Hive is great as a data warehouse but the default mapReduce used is
Jurassic Park.

On the other hand Spark has performant inbuilt API for Hive. Otherwise you
can connect to Hive on a remote cluster through JDBC.

In python you can do

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext


And use it like below


sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
 CREATE TABLE test.randomDataPy(
   ID INT
 , CLUSTERED INT
 , SCATTERED INT
 , RANDOMISED INT
 , RANDOM_STRING VARCHAR(50)
 , SMALL_VC VARCHAR(50)
 , PADDING  VARCHAR(4000)
)
STORED AS PARQUET
"""
  spark.sql(sqltext)

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 11:50, Pralabh Kumar  wrote:

> Hi mich
>
> Thx for replying.your answer really helps. The comparison was done in
> 2016. I would like to know the latest comparison with spark 3.0
>
> Also what you are suggesting is to migrate queries to Spark ,which is
> hivecontxt or hive on spark, which is what Facebook also did
> . Is that understanding correct ?
>
> Regards
> Pralabh
>
> On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
> wrote:
>
>> Hi Prahabh,
>>
>> This question has been asked before :)
>>
>> Few years ago (late 2016),  I made a presentation on running Hive Queries
>> on the Spark execution engine for Hortonworks.
>>
>>
>> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>>
>> The issue you will face will be compatibility problems with versions of
>> Hive and Spark.
>>
>> My suggestion would be to use Spark as a massive parallel processing and
>> Hive as a storage layer. However, you need to test what can be migrated or
>> not.
>>
>> HTH
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar 
>> wrote:
>>
>>> Hi Dev
>>>
>>> I am having thousands of legacy hive queries .  As a plan to move to
>>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>>> two approaches
>>>
>>>
>>>1.  One is Hive on Spark , which is similar to changing the
>>>execution engine in hive queries like TEZ.
>>>2. Another one is migrating hive queries to Hivecontext/sparksql ,
>>>an approach used by Facebook and presented in Spark conference.
>>>
>>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>>.
>>>
>>>
>>> Can you please guide me which option to go for . I am personally
>>> inclined to go for option 2 . It also allows the use of the latest spark .
>>>
>>> Please help me on the same , as there are not much comparisons online
>>> available keeping Spark 3.0 in perspective.
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>>
>>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi mich

Thx for replying.your answer really helps. The comparison was done in 2016.
I would like to know the latest comparison with spark 3.0

Also what you are suggesting is to migrate queries to Spark ,which is
hivecontxt or hive on spark, which is what Facebook also did
. Is that understanding correct ?

Regards
Pralabh

On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
wrote:

> Hi Prahabh,
>
> This question has been asked before :)
>
> Few years ago (late 2016),  I made a presentation on running Hive Queries
> on the Spark execution engine for Hortonworks.
>
>
> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>
> The issue you will face will be compatibility problems with versions of
> Hive and Spark.
>
> My suggestion would be to use Spark as a massive parallel processing and
> Hive as a storage layer. However, you need to test what can be migrated or
> not.
>
> HTH
>
>
> Mich
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:
>
>> Hi Dev
>>
>> I am having thousands of legacy hive queries .  As a plan to move to
>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>> two approaches
>>
>>
>>    1.  One is Hive on Spark , which is similar to changing the execution
>>engine in hive queries like TEZ.
>>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>>approach used by Facebook and presented in Spark conference.
>>
>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>.
>>
>>
>> Can you please guide me which option to go for . I am personally inclined
>> to go for option 2 . It also allows the use of the latest spark .
>>
>> Please help me on the same , as there are not much comparisons online
>> available keeping Spark 3.0 in perspective.
>>
>> Regards
>> Pralabh Kumar
>>
>>
>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Prahabh,

This question has been asked before :)

Few years ago (late 2016),  I made a presentation on running Hive Queries
on the Spark execution engine for Hortonworks.

https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations

The issue you will face will be compatibility problems with versions of
Hive and Spark.

My suggestion would be to use Spark as a massive parallel processing and
Hive as a storage layer. However, you need to test what can be migrated or
not.

HTH


Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:

> Hi Dev
>
> I am having thousands of legacy hive queries .  As a plan to move to Spark
> , we are planning to migrate Hive queries on Spark .  Now there are two
> approaches
>
>
>1.  One is Hive on Spark , which is similar to changing the execution
>engine in hive queries like TEZ.
>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>approach used by Facebook and presented in Spark conference.
>
> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>.
>
>
> Can you please guide me which option to go for . I am personally inclined
> to go for option 2 . It also allows the use of the latest spark .
>
> Please help me on the same , as there are not much comparisons online
> available keeping Spark 3.0 in perspective.
>
> Regards
> Pralabh Kumar
>
>
>


Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi Dev

I am having thousands of legacy hive queries .  As a plan to move to Spark
, we are planning to migrate Hive queries on Spark .  Now there are two
approaches


   1.  One is Hive on Spark , which is similar to changing the execution
   engine in hive queries like TEZ.
   2. Another one is migrating hive queries to Hivecontext/sparksql , an
   approach used by Facebook and presented in Spark conference.
   
https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
   .


Can you please guide me which option to go for . I am personally inclined
to go for option 2 . It also allows the use of the latest spark .

Please help me on the same , as there are not much comparisons online
available keeping Spark 3.0 in perspective.

Regards
Pralabh Kumar


HiveContext on Spark 1.6 Linkage Error:ClassCastException

2017-02-14 Thread Enrico DUrso
Hello guys,
hope all of you are ok.
I am trying to use HiveContext on Spark 1.6, I am developing using Eclipse and 
I placed the hive-site.xml in the classPath, so doing I use the Hive instance 
running on my cluster instead
of creating a local metastore and a local warehouse.
So far so good, in this scenario select * and insert into query work ok, but 
the problem arise when trying to drop table and/or create new ones.
Provided that is not a permission problem, my issue is:
ClassCastException: attempting to cast jar 
file://.../com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar!javax/ws/rs/ext/RunTimeDelegate.class
 to jar cast jar 
file://.../com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar!javax/ws/rs/ext/RunTimeDelegate.class.

As you can see, it is attempting to cast the same jar, and it throws the 
exception, I think because the same jar has been loaded before from a different 
classloader, in fact one is loaded by
org.apache.spark.sql.hive.client.IsolatedClientLoader and the other one by 
sun.misc.Launcher.$AppClassLoader.

Any suggestion to fix this issue? The same happens when building the jar and 
running it with spark-submit (yarn RM).

Cheers,

best



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


HiveContext on Spark 1.6 Linkage Error:ClassCastException

2017-02-14 Thread Enrico DUrso


Hello guys,
hope all of you are ok.
I am trying to use HiveContext on Spark 1.6, I am developing using Eclipse and 
I placed the hive-site.xml in the classPath, so doing I use the Hive instance 
running on my cluster instead
of creating a local metastore and a local warehouse.
So far so good, in this scenario select * and insert into query work ok, but 
the problem arise when trying to drop table and/or create new ones.
Provided that is not a permission problem, my issue is:
ClassCastException: attempting to cast jar 
file://.../com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar!javax/ws/rs/ext/RunTimeDelegate.class
 to jar cast jar 
file://.../com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar!javax/ws/rs/ext/RunTimeDelegate.class.

As you can see, it is attempting to cast the same jar, and it throws the 
exception, I think because the same jar has been loaded before from a different 
classloader, in fact one is loaded by
org.apache.spark.sql.hive.client.IsolatedClientLoader and the other one by 
sun.misc.Launcher.$AppClassLoader.

Any suggestion to fix this issue? The same happens when building the jar and 
running it with spark-submit (yarn RM).

Cheers,

best



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


not table to connect to table using hiveContext

2016-11-01 Thread vinay parekar

Hi there,
   I am trying to get some table data using spark hiveContext. I am getting an 
exception as :


org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table 
rnow_imports_text. null
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1158)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:302)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
   at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298)
   at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
   at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
   at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
   at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
   at scala.Option.getOrElse(Option.scala:120)
   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
   at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Thanks Sean.

I believe you are referring to below statement

"You can't use the HiveContext or SparkContext in a distribution operation.
It has nothing to do with for loops.

The fact that they're serializable is misleading. It's there, I believe,
because these objects may be inadvertently referenced in the closure of a
function that executes remotely, yet doesn't use the context. The closure
cleaner can't always remove this reference. The task would fail to
serialize even though it doesn't use the context. You will find these
objects serialize but then don't work if used remotely."



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 October 2016 at 09:27, Sean Owen <so...@cloudera.com> wrote:

> Yes, but the question here is why the context objects are marked
> serializable when they are not meant to be sent somewhere as bytes. I tried
> to answer that apparent inconsistency below.
>
>
> On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Sorry for asking this rather naïve question.
>>
>> The notion of serialisation in Spark and where it can be serialised or
>> not. Does this generally refer to the concept of serialisation in the
>> context of data storage?
>>
>> In this context for example with reference to RDD operations is
>> it process of translating object state into a format that can be stored
>> and retrieved from memory buffer?
>>
>> Thanks
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:
>>
>> It is the driver that has the info needed to schedule and manage
>> distributed jobs and that is by design.
>>
>> This is narrowly about using the HiveContext or SparkContext directly. Of
>> course SQL operations are distributed.
>>
>>
>> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi Sean,
>>
>> Your point:
>>
>> "You can't use the HiveContext or SparkContext in a distribution
>> operation..."
>>
>> Is this because of design issue?
>>
>> Case in point if I created a DF from RDD and register it as a tempTable,
>> does this imply that any sql calls on that table islocalised and not
>> distributed among executors?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>>
>> Sean, thank you for making it clear. It was helpful.
>>
>> Regards,
>> Ajay
>>
>>
>> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
Yes, but the question here is why the context objects are marked
serializable when they are not meant to be sent somewhere as bytes. I tried
to answer that apparent inconsistency below.

On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Sorry for asking this rather naïve question.
>
> The notion of serialisation in Spark and where it can be serialised or
> not. Does this generally refer to the concept of serialisation in the
> context of data storage?
>
> In this context for example with reference to RDD operations is it process
> of translating object state into a format that can be stored and
> retrieved from memory buffer?
>
> Thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:
>
> It is the driver that has the info needed to schedule and manage
> distributed jobs and that is by design.
>
> This is narrowly about using the HiveContext or SparkContext directly. Of
> course SQL operations are distributed.
>
>
> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>
>


Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Hi,

Sorry for asking this rather naïve question.

The notion of serialisation in Spark and where it can be serialised or not.
Does this generally refer to the concept of serialisation in the context of
data storage?

In this context for example with reference to RDD operations is it process
of translating object state into a format that can be stored and retrieved
from memory buffer?

Thanks




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:

> It is the driver that has the info needed to schedule and manage
> distributed jobs and that is by design.
>
> This is narrowly about using the HiveContext or SparkContext directly. Of
> course SQL operations are distributed.
>
>
> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Sean,
>>
>> Your point:
>>
>> "You can't use the HiveContext or SparkContext in a distribution
>> operation..."
>>
>> Is this because of design issue?
>>
>> Case in point if I created a DF from RDD and register it as a tempTable,
>> does this imply that any sql calls on that table islocalised and not
>> distributed among executors?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>>
>> Sean, thank you for making it clear. It was helpful.
>>
>> Regards,
>> Ajay
>>
>>
>> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>
>>


Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
It is the driver that has the info needed to schedule and manage
distributed jobs and that is by design.

This is narrowly about using the HiveContext or SparkContext directly. Of
course SQL operations are distributed.

On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>


Re: HiveContext is Serialized?

2016-10-26 Thread ayan guha
In your use case, your dedf need not to be a data frame. You could use
SC.textFile().collect.
Even better you can just read off a local file, as your file is very small,
unless you are planning to use yarn cluster mode.
On 26 Oct 2016 16:43, "Ajay Chander" <itsche...@gmail.com> wrote:

> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I was thinking if I can use hiveContext inside foreach like below,
>>>
>>> object Test {
>>>   def main(args: Array[String]): Unit = {
>>>
>>> val conf = new SparkConf()
>>> val sc = new SparkContext(conf)
>>> val hiveContext = new HiveContext(sc)
>>>
>>> val dataElementsFile = args(0)
>>> val deDF = 
>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>
>>> def calculate(de: Row) {
>>>   val dataElement = de.getAs[String]("DataElement").trim
>>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>> TEST_DB.TEST_TABLE1 ")
>>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>> }
>>>
>>> deDF.collect().foreach(calculate)
>>>   }
>>> }
>>>
>>>
>>> I looked at 
>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>  and I see it is extending SqlContext which extends Logging with 
>>> Serializable.
>>>
>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>> time.
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>>


Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Hi Sean,

Your point:

"You can't use the HiveContext or SparkContext in a distribution
operation..."

Is this because of design issue?

Case in point if I created a DF from RDD and register it as a tempTable,
does this imply that any sql calls on that table islocalised and not
distributed among executors?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:

> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I was thinking if I can use hiveContext inside foreach like below,
>>>
>>> object Test {
>>>   def main(args: Array[String]): Unit = {
>>>
>>> val conf = new SparkConf()
>>> val sc = new SparkContext(conf)
>>> val hiveContext = new HiveContext(sc)
>>>
>>> val dataElementsFile = args(0)
>>> val deDF = 
>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>
>>> def calculate(de: Row) {
>>>   val dataElement = de.getAs[String]("DataElement").trim
>>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>> TEST_DB.TEST_TABLE1 ")
>>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>> }
>>>
>>> deDF.collect().foreach(calculate)
>>>   }
>>> }
>>>
>>>
>>> I looked at 
>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>  and I see it is extending SqlContext which extends Logging with 
>>> Serializable.
>>>
>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>> time.
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>>


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Sean, thank you for making it clear. It was helpful.

Regards,
Ajay

On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:

> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Thanks for the response Sean. I have seen the NPE on similar issues very
consistently and assumed that could be the reason :) Thanks for clarifying.
regards
Sunita

On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen <so...@cloudera.com> wrote:

> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Sean Owen
This usage is fine, because you are only using the HiveContext locally on
the driver. It's applied in a function that's used on a Scala collection.

You can't use the HiveContext or SparkContext in a distribution operation.
It has nothing to do with for loops.

The fact that they're serializable is misleading. It's there, I believe,
because these objects may be inadvertently referenced in the closure of a
function that executes remotely, yet doesn't use the context. The closure
cleaner can't always remove this reference. The task would fail to
serialize even though it doesn't use the context. You will find these
objects serialize but then don't work if used remotely.

The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
IIRC.

On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:

> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
>     val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Sunita, Thanks for your time. In my scenario, based on each attribute from
deDF(1 column with just 66 rows), I have to query a Hive table and insert
into another table.

Thanks,
Ajay

On Wed, Oct 26, 2016 at 12:21 AM, Sunita Arvind <sunitarv...@gmail.com>
wrote:

> Ajay,
>
> Afaik Generally these contexts cannot be accessed within loops. The sql
> query itself would run on distributed datasets so it's a parallel
> execution. Putting them in foreach would make it nested in nested. So
> serialization would become hard. Not sure I could explain it right.
>
> If you can create the dataframe in main, you can register it as a table
> and run the queries in main method itself. You don't need to coalesce or
> run the method within foreach.
>
> Regards
> Sunita
>
> On Tuesday, October 25, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>
>>
>> Jeff, Thanks for your response. I see below error in the logs. You think
>> it has to do anything with hiveContext ? Do I have to serialize it before
>> using inside foreach ?
>>
>> 16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
>> threw an exception
>> java.lang.NullPointerException
>> at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLL
>> istener.scala:167)
>> at org.apache.spark.scheduler.SparkListenerBus$class.onPostEven
>> t(SparkListenerBus.scala:42)
>> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
>> istenerBus.scala:31)
>> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
>> istenerBus.scala:31)
>> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBu
>> s.scala:55)
>> at org.apache.spark.util.AsynchronousListenerBus.postToAll(Asyn
>> chronousListenerBus.scala:37)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Asynchronous
>> ListenerBus.scala:80)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.sca
>> la:1181)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(As
>> ynchronousListenerBus.scalnerBus.scala:63)
>>
>> Thanks,
>> Ajay
>>
>> On Tue, Oct 25, 2016 at 11:45 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> In your sample code, you can use hiveContext in the foreach as it is
>>> scala List foreach operation which runs in driver side. But you cannot use
>>> hiveContext in RDD.foreach
>>>
>>>
>>>
>>> Ajay Chander <itsche...@gmail.com>于2016年10月26日周三 上午11:28写道:
>>>
>>>> Hi Everyone,
>>>>
>>>> I was thinking if I can use hiveContext inside foreach like below,
>>>>
>>>> object Test {
>>>>   def main(args: Array[String]): Unit = {
>>>>
>>>> val conf = new SparkConf()
>>>> val sc = new SparkContext(conf)
>>>> val hiveContext = new HiveContext(sc)
>>>>
>>>> val dataElementsFile = args(0)
>>>> val deDF = 
>>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>>
>>>> def calculate(de: Row) {
>>>>   val dataElement = de.getAs[String]("DataElement").trim
>>>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>>> TEST_DB.TEST_TABLE1 ")
>>>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>>> }
>>>>
>>>> deDF.collect().foreach(calculate)
>>>>   }
>>>> }
>>>>
>>>>
>>>> I looked at 
>>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>>  and I see it is extending SqlContext which extends Logging with 
>>>> Serializable.
>>>>
>>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>>> time.
>>>>
>>>> Regards,
>>>>
>>>> Ajay
>>>>
>>>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Ajay,

Afaik Generally these contexts cannot be accessed within loops. The sql
query itself would run on distributed datasets so it's a parallel
execution. Putting them in foreach would make it nested in nested. So
serialization would become hard. Not sure I could explain it right.

If you can create the dataframe in main, you can register it as a table and
run the queries in main method itself. You don't need to coalesce or run
the method within foreach.

Regards
Sunita

On Tuesday, October 25, 2016, Ajay Chander <itsche...@gmail.com> wrote:

>
> Jeff, Thanks for your response. I see below error in the logs. You think
> it has to do anything with hiveContext ? Do I have to serialize it before
> using inside foreach ?
>
> 16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
> threw an exception
> java.lang.NullPointerException
> at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(
> SQLListener.scala:167)
> at org.apache.spark.scheduler.SparkListenerBus$class.onPostEven
> t(SparkListenerBus.scala:42)
> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
> istenerBus.scala:31)
> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
> istenerBus.scala:31)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBu
> s.scala:55)
> at org.apache.spark.util.AsynchronousListenerBus.postToAll(Asyn
> chronousListenerBus.scala:37)
> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$
> anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Asynchro
> nousListenerBus.scala:80)
> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$
> anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousLis
> tenerBus.scala:65)
> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$
> anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousLis
> tenerBus.scala:65)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$
> anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.sca
> la:1181)
> at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(
> AsynchronousListenerBus.scalnerBus.scala:63)
>
> Thanks,
> Ajay
>
> On Tue, Oct 25, 2016 at 11:45 PM, Jeff Zhang <zjf...@gmail.com
> <javascript:_e(%7B%7D,'cvml','zjf...@gmail.com');>> wrote:
>
>>
>> In your sample code, you can use hiveContext in the foreach as it is
>> scala List foreach operation which runs in driver side. But you cannot use
>> hiveContext in RDD.foreach
>>
>>
>>
>> Ajay Chander <itsche...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>>于2016年10月26日周三
>> 上午11:28写道:
>>
>>> Hi Everyone,
>>>
>>> I was thinking if I can use hiveContext inside foreach like below,
>>>
>>> object Test {
>>>   def main(args: Array[String]): Unit = {
>>>
>>> val conf = new SparkConf()
>>> val sc = new SparkContext(conf)
>>> val hiveContext = new HiveContext(sc)
>>>
>>> val dataElementsFile = args(0)
>>> val deDF = 
>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>
>>> def calculate(de: Row) {
>>>   val dataElement = de.getAs[String]("DataElement").trim
>>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>> TEST_DB.TEST_TABLE1 ")
>>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>> }
>>>
>>> deDF.collect().foreach(calculate)
>>>   }
>>> }
>>>
>>>
>>> I looked at 
>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>  and I see it is extending SqlContext which extends Logging with 
>>> Serializable.
>>>
>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>> time.
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>>
>


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Jeff, Thanks for your response. I see below error in the logs. You think it
has to do anything with hiveContext ? Do I have to serialize it before
using inside foreach ?

16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
threw an exception
java.lang.NullPointerException
at org.apache.spark.sql.execution.ui.SQLListener.
onTaskEnd(SQLListener.scala:167)
at org.apache.spark.scheduler.SparkListenerBus$class.
onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(
LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(
LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(
ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(
AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(
AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(
AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(
AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.
scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1.run(AsynchronousListenerBus.scalnerBus.scala:63)

Thanks,
Ajay

On Tue, Oct 25, 2016 at 11:45 PM, Jeff Zhang <zjf...@gmail.com> wrote:

>
> In your sample code, you can use hiveContext in the foreach as it is scala
> List foreach operation which runs in driver side. But you cannot use
> hiveContext in RDD.foreach
>
>
>
> Ajay Chander <itsche...@gmail.com>于2016年10月26日周三 上午11:28写道:
>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Jeff Zhang
In your sample code, you can use hiveContext in the foreach as it is scala
List foreach operation which runs in driver side. But you cannot use
hiveContext in RDD.foreach



Ajay Chander <itsche...@gmail.com>于2016年10月26日周三 上午11:28写道:

> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
>     val hiveContext = new HiveContext(sc)
>
> val dataElementsFile = args(0)
> val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
> def calculate(de: Row) {
>   val dataElement = de.getAs[String]("DataElement").trim
>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
> }
>
> deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>


HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Hi Everyone,

I was thinking if I can use hiveContext inside foreach like below,

object Test {
  def main(args: Array[String]): Unit = {

val conf = new SparkConf()
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)

val dataElementsFile = args(0)
val deDF = 
hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()

def calculate(de: Row) {
  val dataElement = de.getAs[String]("DataElement").trim
  val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" +
dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM
TEST_DB.TEST_TABLE1 ")
  df1.write.insertInto("TEST_DB.TEST_TABLE1")
}

deDF.collect().foreach(calculate)
  }
}


I looked at 
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
and I see it is extending SqlContext which extends Logging with
Serializable.

Can anyone tell me if this is the right way to use it ? Thanks for your time.

Regards,

Ajay


Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Hi Todd,

Thanks for the hint.

As it happened this works

//Create the sparkconf for streaming as usual

 val sparkConf = new SparkConf().
 setAppName(sparkAppName).
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
 // change the values accordingly.
 sparkConf.set("sparkDefaultParllelism",
sparkDefaultParallelismValue)
 sparkConf.set("sparkSerializer", sparkSerializerValue)
 sparkConf.set("sparkNetworkTimeOut", sparkNetworkTimeOutValue)
 // If you want to see more details of batches please increase
the value
 // and that will be shown UI.
 sparkConf.set("sparkStreamingUiRetainedBatches",
   sparkStreamingUiRetainedBatchesValue)
 sparkConf.set("sparkWorkerUiRetainedDrivers",
   sparkWorkerUiRetainedDriversValue)
 sparkConf.set("sparkWorkerUiRetainedExecutors",
   sparkWorkerUiRetainedExecutorsValue)
 sparkConf.set("sparkWorkerUiRetainedStages",
   sparkWorkerUiRetainedStagesValue)
 sparkConf.set("sparkUiRetainedJobs", sparkUiRetainedJobsValue)

*sparkConf.set("enableHiveSupport","true")* if (memorySet ==
"T") {
   sparkConf.set("spark.driver.memory", "18432M")
 }

sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true")
 sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",
"true")

sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite",
"true")

sparkConf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite",
"true")

 val batchInterval = 2

// Create the streamingContext

 val streamingContext = new StreamingContext(sparkConf,
Seconds(batchInterval))
// Create SparkContext based on streamingContext

 val sparkContext  = streamingContext.sparkContext

// Create HiveContext based on streamingContext and sparkContext

val HiveContext = new HiveContext(streamingContext.sparkContext)


And that works although sometimes it feels like black art to make it work :)

Regards



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 September 2016 at 15:08, Todd Nist <tsind...@gmail.com> wrote:

> Hi Mich,
>
> Perhaps the issue is having multiple SparkContexts in the same JVM (
> https://issues.apache.org/jira/browse/SPARK-2243).
> While it is possible, I don't think it is encouraged.
>
> As you know, the call your currently invoking to create the
> StreamingContext also creates a
> SparkContext.
>
> /** * Create a StreamingContext by providing the configuration necessary
> for a new SparkContext.
> * @param conf a org.apache.spark.SparkConf object specifying Spark
> parameters
> * @param batchDuration the time interval at which streaming data will be
> divided into batches
> */
> def this(conf: SparkConf, batchDuration: Duration) = {
> this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
> }
>
>
> Could you rearrange the code slightly to either create the SparkContext
> first and pass that to the creation of the StreamContext
> like below:
>
> val sc = new SparkContext(sparkConf)
> val streamingContext = new StreamingContext(sc, Seconds(batchInterval))
>
> *val HiveContext = new HiveContext(sc)*
>
> Or remove / replace the line in red from your code and just set the val
> sparkContext = streamingContext.sparkContext.
>
> val streamingContext = new StreamingContext(sparkConf,
> Seconds(batchInterval))
> *val sparkContext  = new SparkContext(sparkConf)*
> val HiveContext = new HiveContext(streamingContext.sparkContext)
>
> HTH.
>
> -Todd
>
>
> On Thu, Sep 8, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> Ok I managed to sort that one out.
>>
>> This is what I am facing
>>
>>  val sparkConf = new SparkConf().
>>  setAppName(sparkAppName).
>>  set("spark.driver.al

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Todd Nist
Hi Mich,

Perhaps the issue is having multiple SparkContexts in the same JVM (
https://issues.apache.org/jira/browse/SPARK-2243).
While it is possible, I don't think it is encouraged.

As you know, the call your currently invoking to create the
StreamingContext also creates a
SparkContext.

/** * Create a StreamingContext by providing the configuration necessary
for a new SparkContext.
* @param conf a org.apache.spark.SparkConf object specifying Spark
parameters
* @param batchDuration the time interval at which streaming data will be
divided into batches
*/
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}


Could you rearrange the code slightly to either create the SparkContext
first and pass that to the creation of the StreamContext
like below:

val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, Seconds(batchInterval))

*val HiveContext = new HiveContext(sc)*

Or remove / replace the line in red from your code and just set the val
sparkContext = streamingContext.sparkContext.

val streamingContext = new StreamingContext(sparkConf,
Seconds(batchInterval))
*val sparkContext  = new SparkContext(sparkConf)*
val HiveContext = new HiveContext(streamingContext.sparkContext)

HTH.

-Todd


On Thu, Sep 8, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok I managed to sort that one out.
>
> This is what I am facing
>
>  val sparkConf = new SparkConf().
>  setAppName(sparkAppName).
>  set("spark.driver.allowMultipleContexts", "true").
>  set("spark.hadoop.validateOutputSpecs", "false")
>  // change the values accordingly.
>  sparkConf.set("sparkDefaultParllelism",
> sparkDefaultParallelismValue)
>  sparkConf.set("sparkSerializer", sparkSerializerValue)
>  sparkConf.set("sparkNetworkTimeOut",
> sparkNetworkTimeOutValue)
>  // If you want to see more details of batches please increase
> the value
>  // and that will be shown UI.
>  sparkConf.set("sparkStreamingUiRetainedBatches",
>sparkStreamingUiRetainedBatchesValue)
>  sparkConf.set("sparkWorkerUiRetainedDrivers",
>sparkWorkerUiRetainedDriversValue)
>  sparkConf.set("sparkWorkerUiRetainedExecutors",
>sparkWorkerUiRetainedExecutorsValue)
>  sparkConf.set("sparkWorkerUiRetainedStages",
>sparkWorkerUiRetainedStagesValue)
>  sparkConf.set("sparkUiRetainedJobs",
> sparkUiRetainedJobsValue)
>  sparkConf.set("enableHiveSupport",enableHiveSupportValue)
>  sparkConf.set("spark.streaming.stopGracefullyOnShutdown","
> true")
>  sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",
> "true")
>  
> sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite",
> "true")
>  
> sparkConf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite",
> "true")
>  var sqltext = ""
>  val batchInterval = 2
>  val streamingContext = new StreamingContext(sparkConf,
> Seconds(batchInterval))
>
> With the above settings,  Spark streaming works fine. *However, after
> adding the first line below (in red)*
>
> *val sparkContext  = new SparkContext(sparkConf)*
> val HiveContext = new HiveContext(streamingContext.sparkContext)
>
> I get the following errors:
>
> 16/09/08 14:02:32 ERROR JobScheduler: Error running job streaming job
> 1473339752000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 0.0 (TID 7, 50.140.197.217): java.io.IOException:
> *org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of
> broadcast_0*at org.apache.spark.util.Utils$.
> tryOrIOException(Utils.scala:1260)
> at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(
> TorrentBroadcast.scala:174)
> at org.apache.spark.broadcast.TorrentBroadcast._value$
> lzycompute(TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast._value(
> TorrentBroadcast.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast.getValue(
> TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Ok I managed to sort that one out.

This is what I am facing

 val sparkConf = new SparkConf().
 setAppName(sparkAppName).
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
 // change the values accordingly.
 sparkConf.set("sparkDefaultParllelism",
sparkDefaultParallelismValue)
 sparkConf.set("sparkSerializer", sparkSerializerValue)
 sparkConf.set("sparkNetworkTimeOut", sparkNetworkTimeOutValue)
 // If you want to see more details of batches please increase
the value
 // and that will be shown UI.
 sparkConf.set("sparkStreamingUiRetainedBatches",
   sparkStreamingUiRetainedBatchesValue)
 sparkConf.set("sparkWorkerUiRetainedDrivers",
   sparkWorkerUiRetainedDriversValue)
 sparkConf.set("sparkWorkerUiRetainedExecutors",
   sparkWorkerUiRetainedExecutorsValue)
 sparkConf.set("sparkWorkerUiRetainedStages",
   sparkWorkerUiRetainedStagesValue)
 sparkConf.set("sparkUiRetainedJobs", sparkUiRetainedJobsValue)
 sparkConf.set("enableHiveSupport",enableHiveSupportValue)

sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true")
 sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",
"true")

sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite",
"true")

sparkConf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite",
"true")
 var sqltext = ""
 val batchInterval = 2
 val streamingContext = new StreamingContext(sparkConf,
Seconds(batchInterval))

With the above settings,  Spark streaming works fine. *However, after
adding the first line below (in red)*

*val sparkContext  = new SparkContext(sparkConf)*
val HiveContext = new HiveContext(streamingContext.sparkContext)

I get the following errors:

16/09/08 14:02:32 ERROR JobScheduler: Error running job streaming job
1473339752000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage
0.0 (TID 7, 50.140.197.217): java.io.IOException:
*org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of
broadcast_0*at
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get
broadcast_0_piece0 of broadcast_0


Hm any ideas?

Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 September 2016 at 12:28, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> Hi,
>
> This may not be feasible in Spark streaming.
>
> I am trying to create a HiveContext in Spark streaming within the
> streaming context
>
> // Create a local StreamingContext with two working thread and batch
> interval of 2 seconds.
>
>  val sparkConf = new SparkConf().
>  setAppName(sparkAppName).
>  set("spark.driver.allowMultipleContexts", "true").
>  set("spark.hadoop.validateOutputSpecs", "false")
> .
>
> Now try to create an sc
>
> val sc = new 

Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Hi,

This may not be feasible in Spark streaming.

I am trying to create a HiveContext in Spark streaming within the streaming
context

// Create a local StreamingContext with two working thread and batch
interval of 2 seconds.

 val sparkConf = new SparkConf().
 setAppName(sparkAppName).
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
.

Now try to create an sc

val sc = new SparkContext(sparkConf)
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

This is accepted but it creates two spark jobs


[image: Inline images 1]

And basically it goes to a waiting state

Any ideas how one  can create a HiveContext within Spark streaming?

Thanks






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Mich Talebzadeh
this is Spark 2

you create temp table from df using HiveContext

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> s.registerTempTable("tmp")
scala> HiveContext.sql("select count(1) from tmp")
res18: org.apache.spark.sql.DataFrame = [count(1): bigint]
scala> HiveContext.sql("select count(1) from tmp").show
[Stage 5:>(0 + 1) /
100]

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 August 2016 at 17:27, Richard M <richard.moorh...@gmail.com> wrote:

> How are you calling registerTempTable from hiveContext? It appears to be a
> private method.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-
> in-HiveContext-tp26555p27514.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Richard M
How are you calling registerTempTable from hiveContext? It appears to be a
private method.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARKSQL with HiveContext My job fails

2016-08-04 Thread Mich Talebzadeh
Well the error states


Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead
limit exceeded

Cause: The detail message "GC overhead limit exceeded" indicates that the
garbage collector is running all the time and Java program is making very
slow progress. After a garbage collection, if the Java process is spending
more than approximately 98% of its time doing garbage collection and if it
is recovering less than 2% of the heap and has been doing so far the last 5
(compile time constant) consecutive garbage collections, then a
java.lang.OutOfMemoryError is thrown. This exception is typically thrown
because the amount of live data barely fits into the Java heap having
little free space for new allocations.
Action: Increase the heap size. The java.lang.OutOfMemoryError exception
for *GC Overhead limit exceeded* can be turned off with the command line
flag -XX:-UseGCOverheadLimit.

We still don't know what the code is doing. You have not provided that
info. Are you running Spark on Yarn?. Have you checked yarn logs?


HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 10:49, Vasu Devan  wrote:

> Hi Team,
>
> My Spark job fails with below error :
>
> Could you please advice me what is the problem with my job.
>
> Below is my error stack:
>
> 16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem
> [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
> at
> sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:388)
> at
> sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:77)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:46)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
> at
> akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
> at scala.util.Try$.apply(Try.scala:161)
> 16/08/04 05:11:06 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 16/08/04 05:11:07 INFO 

SPARKSQL with HiveContext My job fails

2016-08-04 Thread Vasu Devan
Hi Team,

My Spark job fails with below error :

Could you please advice me what is the problem with my job.

Below is my error stack:

16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at
sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:388)
at
sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:77)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:46)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
16/08/04 05:11:06 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
16/08/04 05:11:07 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18540.0 in stage 148.0
(TID 153058) in 190291 ms on lhrrhegapq005.enterprisenet.org (18536/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18529.0 in stage 148.0
(TID 153044) in 190300 ms on lhrrhegapq008.enterprisenet.org (18537/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18530.0 in stage 148.0
(TID 153049) in 190297 ms on lhrrhegapq005.enterprisenet.org (18538/32768)
16/08/04 05:11:07 INFO TaskSetManager: Finished task 18541.0 in stage 148.0
(TID 153062) in 190291 ms on lhrrhegapq006.enterprisenet.org (18539/32768)
16/08/04 05:11:09 INFO TaskSetManager: Finished task 18537.0 in stage 148.0
(TID 153057) in 191648 ms on lhrrhegapq003.enterprisenet.org (18540/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18557.0 in stage 148.0
(TID 153073) in 193193 ms on lhrrhegapq003.enterprisenet.org (18541/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18528.0 in stage 148.0
(TID 153045) in 193206 ms on lhrrhegapq007.enterprisenet.org (18542/32768)
16/08/04 05:11:10 INFO TaskSetManager: Finished task 18555.0 in stage 148.0
(TID 153072) in 193195 ms on lhrrhegapq002.enterprisenet.org (18543/32768)
16/08/04 05:11:10 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state FINISHED!
16/08/04 05:11:13 WARN QueuedThreadPool: 9 threads could not be stopped
16/08/04 05:11:13 INFO SparkUI: Stopped Spark web UI at
http://10.90.50.64:4043
16/08/04 05:11:15 INFO DAGScheduler: Stopping DAGScheduler
16/08/04 05:11:16 INFO DAGScheduler: Job 94 failed: save at
ndx_scala_util.scala:1264, took 232.788303 s
16/08/04 05:11:16 ERROR InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
at

HiveContext , difficulties in accessing tables in hive schema's/database's other than default database.

2016-07-19 Thread satyajit vegesna
Hi All,

I have been trying to access tables from other schema's , apart from
default , to pull data into dataframe.

i was successful in doing it using the default schema in hive database.
But when i try any other schema/database in hive, i am getting below
error.(Have also not seen any examples related to accessing tables in other
schema/Database apart from default).

16/07/19 18:16:06 INFO hive.metastore: Connected to metastore.
16/07/19 18:16:08 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 472.3 KB, free 472.3 KB)
16/07/19 18:16:08 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 39.6 KB, free 511.9 KB)
16/07/19 18:16:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:41434 (size: 39.6 KB, free: 2.4 GB)
16/07/19 18:16:08 INFO spark.SparkContext: Created broadcast 0 from show at
sparkHive.scala:70
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.hive.ql.exec.Utilities.copyTableJobPropertiesToConf(Lorg/apache/hadoop/hive/ql/plan/TableDesc;Lorg/apache/hadoop/mapred/JobConf;)V
at
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:324)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1538)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1537)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1544)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1414)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2138)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
at 

Re: HiveContext

2016-07-01 Thread Mich Talebzadeh
hi,

In general if your ORC tables is not bucketed it is not going to do much.

the idea is that using predicate pushdown you will only get the data from
the partition concerned and avoid expensive table scans!

Orc provides what is known as store index at file, stripe and rowset levels
(default 10K rows). That is just statistics for min, avg and max for each
column.

Now going back to practicality, you can do a simple test. log in to hive
and run your query with EXPLAIN EXTENDED  select ... and see what you see.

then try it from Spark. As far as I am aware Spark will not rely on
anythinh Hive wise, except the metadata. it will use DAG and in-memory
capability to do the query.

just try it and see.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 11:00, manish jaiswal <manishsr...@gmail.com> wrote:

> Hi,
>
> Using sparkHiveContext when we read all rows where age was between 0 and
> 100, even though we requested rows where age was less than 15. Such full
> table scanning is an expensive operation.
>
> ORC avoids this type of overhead by using predicate push-down with three
> levels of built-in indexes within each file: file level, stripe level, and
> row level:
>
>-
>
>File and stripe level statistics are in the file footer, making it
>easy to determine if the rest of the file needs to be read.
>-
>
>Row level indexes include column statistics for each row group and
>position, for seeking to the start of the row group.
>
> ORC utilizes these indexes to move the filter operation to the data
> loading phase, by reading only data that potentially includes required rows.
>
>
> My doubt is when we give some query to hiveContext in orc table using
> spark with
>
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>
> how it will perform
>
> 1.it will fetch only those record from orc file according to query.or
>
> 2.it will take orc file in spark and then perform spark job using predicate 
> push-down
>
> and give you the records.
>
> (I am aware of hiveContext gives spark only metadata and location of the data)
>
>
> Thanks
>
> Manish
>
>


HiveContext

2016-07-01 Thread manish jaiswal
Hi,

Using sparkHiveContext when we read all rows where age was between 0 and
100, even though we requested rows where age was less than 15. Such full
table scanning is an expensive operation.

ORC avoids this type of overhead by using predicate push-down with three
levels of built-in indexes within each file: file level, stripe level, and
row level:

   -

   File and stripe level statistics are in the file footer, making it easy
   to determine if the rest of the file needs to be read.
   -

   Row level indexes include column statistics for each row group and
   position, for seeking to the start of the row group.

ORC utilizes these indexes to move the filter operation to the data loading
phase, by reading only data that potentially includes required rows.


My doubt is when we give some query to hiveContext in orc table using spark
with

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

how it will perform

1.it will fetch only those record from orc file according to query.or

2.it will take orc file in spark and then perform spark job using
predicate push-down

and give you the records.

(I am aware of hiveContext gives spark only metadata and location of the data)


Thanks

Manish


HiveContext

2016-06-30 Thread manish jaiswal
-- Forwarded message --
From: "manish jaiswal" <manishsr...@gmail.com>
Date: Jun 30, 2016 17:35
Subject: HiveContext
To: <user@spark.apache.org>, <user-subscr...@spark.apache.org>, <
user-h...@spark.apache.org>
Cc:

Hi,


I am new to Spark.I found using HiveContext we can connect to hive and run
HiveQLs. I run it and it worked.

My doubt is when we are using hiveContext and run hive query like(select
distinct column from table).

how it will perform it will take all data stored in hdfs into spark
engine(memory) and perform (select distinct column from table) or
it will give to hive and get result from hive.?



Thanks


Re: hivecontext error

2016-06-14 Thread Ted Yu
Which release of Spark are you using ?

Can you show the full error trace ?

Thanks

On Tue, Jun 14, 2016 at 6:33 PM, Tejaswini Buche <
tejaswini.buche0...@gmail.com> wrote:

> I am trying to use hivecontext in spark. The following statements are
> running fine :
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
>
> But, when i run the below statement,
>
> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>
> I get the following error :
>
> Java Package object not callable
>
> what could be the problem?
> thnx
>


hivecontext error

2016-06-14 Thread Tejaswini Buche
I am trying to use hivecontext in spark. The following statements are
running fine :

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

But, when i run the below statement,

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

I get the following error :

Java Package object not callable

what could be the problem?
thnx


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv
I'm using EC2 instances 

Thank you.
Daniel

> On 9 Jun 2016, at 16:49, Gourav Sengupta  wrote:
> 
> Hi,
> 
> are you using EC2 instances or local cluster behind firewall.
> 
> 
> Regards,
> Gourav Sengupta
> 
>> On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv 
>>  wrote:
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
>> MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: 
>> Unable to load AWS credentials from any provider in the chain)
>>  
>> I tried setting the s3a keys using the configuration object but I might be 
>> hitting SPARK-11364 :
>> conf.set("fs.s3a.access.key", accessKey)
>> conf.set("fs.s3a.secret.key", secretKey)
>> conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
>> conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
>> val sc = new SparkContext(conf)
>>  
>> I tried setting these propeties in hdfs-site.xml but i'm still getting this 
>> error.
>> Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY 
>> environment variables but with no luck.
>>  
>> Any ideas on how to resolve this issue ?
>>  
>> Thank you.
>> Daniel
>> 
>> Thank you.
>> Daniel
> 


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Gourav Sengupta
Hi,

are you using EC2 instances or local cluster behind firewall.


Regards,
Gourav Sengupta

On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
>
> I'm trying to create a table on s3a but I keep hitting the following error:
>
> Exception in thread "main"
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(*message:com.cloudera.com.amazonaws.AmazonClientException:
> Unable to load AWS credentials from any provider in the chain*)
>
>
>
> I tried setting the s3a keys using the configuration object but I might be
> hitting SPARK-11364  :
>
> conf.set("fs.s3a.access.key", accessKey)
> conf.set("fs.s3a.secret.key", secretKey)
> conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
> conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
>
> val sc = new SparkContext(conf)
>
>
>
> I tried setting these propeties in hdfs-site.xml but i'm still getting
> this error.
>
> Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
> environment variables but with no luck.
>
>
>
> Any ideas on how to resolve this issue ?
>
>
>
> Thank you.
>
> Daniel
>
> Thank you.
> Daniel
>


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Steve Loughran

On 9 Jun 2016, at 06:17, Daniel Haviv 
> wrote:

Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.

Thank you.
Daniel


That's not good.

I'm afraid I don't know what version of s3a is in the cloudera release —I can 
see that the amazon stuff has been shaded, but don't know about the hadoop side 
and its auth.

One thing: can you try using s3n rather than s3a. I do think s3a is now better 
(and will be *really* good soon), but as s3n has been around for a long time, 
it's the baseline for functionality.

And I've just created some homework to do better logging of what's going on the 
s3a driver, though that bit of startup code in spark might interfere. 
https://issues.apache.org/jira/browse/HADOOP-13252


There's not much else i can do I'm afraid, not without patching your hadoop 
source and rebuilding things

-Steve






Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.

Thank you.
Daniel

> On 9 Jun 2016, at 01:11, Steve Loughran  wrote:
> 
> 
>> On 8 Jun 2016, at 16:34, Daniel Haviv  
>> wrote:
>> 
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
>> MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: 
>> Unable to load AWS credentials from any provider in the chain)
>>  
>> I tried setting the s3a keys using the configuration object but I might be 
>> hitting SPARK-11364 :
>> conf.set("fs.s3a.access.key", accessKey)
>> conf.set("fs.s3a.secret.key", secretKey)
>> conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
>> conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
>> val sc = new SparkContext(conf)
>>  
>> I tried setting these propeties in hdfs-site.xml but i'm still getting this 
>> error.
> 
> 
> 
> try core-site.xml rather than hdfs-site.xml; the latter only gets loaded when 
> an HdfsConfiguration() instances is created; it may be a bit too late.
> 
>> Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY 
>> environment variables but with no luck.
> 
> Those env vars aren't picked up directly by S3a (well, that was fixed over 
> the weekend https://issues.apache.org/jira/browse/HADOOP-12807  ); There's 
> some fixup in spark ( see 
> SparkHadoopUtil.appendS3AndSparkHadoopConfigurations() ); I don't know if 
> that is a factor; 
> 
>> Any ideas on how to resolve this issue ?
>>  
>> Thank you.
>> Daniel
>> 
>> Thank you.
>> Daniel
> 


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Steve Loughran

On 8 Jun 2016, at 16:34, Daniel Haviv 
> wrote:

Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable 
to load AWS credentials from any provider in the chain)



I tried setting the s3a keys using the configuration object but I might be 
hitting SPARK-11364 :

conf.set("fs.s3a.access.key", accessKey)
conf.set("fs.s3a.secret.key", secretKey)
conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)

val sc = new SparkContext(conf)



I tried setting these propeties in hdfs-site.xml but i'm still getting this 
error.



try core-site.xml rather than hdfs-site.xml; the latter only gets loaded when 
an HdfsConfiguration() instances is created; it may be a bit too late.

Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY environment 
variables but with no luck.




Those env vars aren't picked up directly by S3a (well, that was fixed over the 
weekend https://issues.apache.org/jira/browse/HADOOP-12807  ); There's some 
fixup in spark ( see SparkHadoopUtil.appendS3AndSparkHadoopConfigurations() ); 
I don't know if that is a factor;

Any ideas on how to resolve this issue ?



Thank you.
Daniel

Thank you.
Daniel



Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread lalit sharma
To add on what Vikash said above, bit more internals :
1. There are 2 components which work together to achieve Hive + Spark
integration
   a. HiveContext which extends SqlContext adds logic to add hive specific
things e.g. loading jars to talk to underlying metastore db, load configs
in hive-site.xml
   b. HiveThriftServer2 which uses native HiveServer2 and add logic for
creating sessions, handling operations.
2. Once thrift server is up , authentication , session management is all
delegated to Hive classes. Once parsing of query is done and logical plan
is created and passed on to create DataFrame.

So no mapReduce , spark intelligently uses needed pieces from Hive and use
its own execution engine.

--Regards,
Lalit

On Wed, Jun 8, 2016 at 9:59 PM, Vikash Pareek <vikash.par...@infoobjects.com
> wrote:

> Himanshu,
>
> Spark doesn't use hive execution engine (Map Reduce) to execute query.
> Spark
> only reads the meta data from hive meta store db and executes the query
> within Spark execution engine. This meta data is used by Spark's own SQL
> execution engine (this includes components such as catalyst, tungsten to
> optimize queries) to execute query and generate result faster than hive
> (Map
> Reduce).
>
> Using HiveContext means connecting to hive meta store db. Thus, HiveContext
> can access hive meta data, and hive meta data includes location of data,
> serialization and de-serializations, compression codecs, columns, datatypes
> etc. thus, Spark have enough information about the hive tables and it's
> data
> to understand the target data and execute the query over its on execution
> engine.
>
> Overall, Spark replaced the Map Reduce model completely by it's
> in-memory(RDD) computation engine.
>
> - Vikash Pareek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114p27117.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Vikash Pareek
Himanshu,

Spark doesn't use hive execution engine (Map Reduce) to execute query. Spark
only reads the meta data from hive meta store db and executes the query
within Spark execution engine. This meta data is used by Spark's own SQL
execution engine (this includes components such as catalyst, tungsten to
optimize queries) to execute query and generate result faster than hive (Map
Reduce).

Using HiveContext means connecting to hive meta store db. Thus, HiveContext
can access hive meta data, and hive meta data includes location of data,
serialization and de-serializations, compression codecs, columns, datatypes
etc. thus, Spark have enough information about the hive tables and it's data
to understand the target data and execute the query over its on execution
engine.

Overall, Spark replaced the Map Reduce model completely by it's
in-memory(RDD) computation engine.

- Vikash Pareek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114p27117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable 
to load AWS credentials from any provider in the chain)
 
I tried setting the s3a keys using the configuration object but I might be 
hitting SPARK-11364 :
conf.set("fs.s3a.access.key", accessKey)
conf.set("fs.s3a.secret.key", secretKey)
conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
val sc = new SparkContext(conf)
 
I tried setting these propeties in hdfs-site.xml but i'm still getting this 
error.
Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY environment 
variables but with no luck.
 
Any ideas on how to resolve this issue ?
 
Thank you.
Daniel

Thank you.
Daniel

When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Himanshu Mehra
So what happens underneath when we query on a hive table using hiveContext? 

1. Does Spark talks to metastore to get the data location on hdfs and read
the data from there to perform those queries?
2. Spark passes those queries to hive and hive executes those queries on the
table and returns the results to spark? In this case, might hive be using
map-reduce to execute the queries?

Please clarify this confusion. I have looked into the code seems like spark
is just fetching the data from hdfs. Please convince me otherwise.

Thanks

Best



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hivecontext and date format

2016-06-01 Thread Mich Talebzadeh
Try this

SELECT

TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd'))
AS paymentdate
FROM

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 1 June 2016 at 12:16, pseudo oduesp <pseudo20...@gmail.com> wrote:

> Hi ,
> can i ask you how we can convert string like dd/mm/ to date type in
> hivecontext?
>
> i try with unix_timestemp and with format date but i get null .
> thank you.
>


hivecontext and date format

2016-06-01 Thread pseudo oduesp
Hi ,
can i ask you how we can convert string like dd/mm/ to date type in
hivecontext?

i try with unix_timestemp and with format date but i get null .
thank you.


Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Michael Segel
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who 
bought the company way back when.  (Since IBM released it under Apache 
licensing, Sun Microsystems took it and created JavaDB…) 

I believe that there is a networking function so that you can either bring it 
up in stand alone mode or networking mode that allows simultaneous network 
connections (multi-user). 

If not you can always go MySQL.

HTH

> On May 26, 2016, at 1:36 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Well make sure than you set up a reasonable RDBMS as metastore. Ours is 
> Oracle but you can get away with others. Check the supported list in
> 
> hduser@rhes564:: :/usr/lib/hive/scripts/metastore/upgrade> ltr
> total 40
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 postgres
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mysql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mssql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 derby
> drwxr-xr-x 3 hduser hadoop 4096 May 20 18:44 oracle
> 
> you have few good ones in the list.  In general the base tables (without 
> transactional support) are around 55  (Hive 2) and don't take much space 
> (depending on the volume of tables). I attached a E-R diagram.
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 26 May 2016 at 19:09, Gerard Maas <gerard.m...@gmail.com 
> <mailto:gerard.m...@gmail.com>> wrote:
> Thanks a lot for the advice!. 
> 
> I found out why the standalone hiveContext would not work:  it was trying to 
> deploy a derby db and the user had no rights to create the dir where there db 
> is stored:
> 
> Caused by: java.sql.SQLException: Failed to create database 'metastore_db', 
> see the next exception for details.
> 
>at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> 
>at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> 
>... 129 more
> 
> Caused by: java.sql.SQLException: Directory 
> /usr/share/spark-notebook/metastore_db cannot be created.
> 
> 
> 
> Now, the new issue is that we can't start more than 1 context at the same 
> time. I think we will need to setup a proper metastore.
> 
> 
> 
> -kind regards, Gerard.
> 
> 
> 
> 
> 
> On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> To use HiveContext witch is basically an sql api within Spark without proper 
> hive set up does not make sense. It is a super set of Spark SQLContext
> 
> In addition simple things like registerTempTable may not work.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 26 May 2016 at 13:01, Silvio Fiorito <silvio.fior...@granturing.com 
> <mailto:silvio.fior...@granturing.com>> wrote:
> Hi Gerard,
> 
>  
> 
> I’ve never had an issue using the HiveContext without a hive-site.xml 
> configured. However, one issue you may have is if multiple users are starting 
> the HiveContext from the same path, they’ll all be trying to store the 
> default Derby metastore in the same location. Also, if you want them to be 
> able to persist permanent table metadata for SparkSQL then you’ll want to set 
> up a true metastore.
> 
>  
> 
> The other thing it could be is Hive dependency collisions from the classpath, 
> but that shouldn’t be an issue since you said it’s standalone (not a Hadoop 
> distro right?).
> 
>  
> 
> Thanks,
> 
> Silvio
> 
>  
> 
> From: Gerard Maas <gerard.m...@gmail.com <mailto:gerard.m...@gmail.com>>
> Date: Thursday, May 26, 2016 at 5:28 AM
> To: spark users <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: HiveContext standalone => without a Hive metastore
> 
>  
> 
> Hi,
> 
>  
> 
> I'm helping some folks setting up an analytics cluster with  Spark.
> 
> They want to use the HiveContext to enable the Window functions on 
> DataFrames(*) but they don't have any Hive installation, nor they need one at 
> the moment (if not necessa

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Gerard Maas
Michael,  Mitch, Silvio,

Thanks!

The own directoy is the issue. We are running the Spark Notebook, which
uses the same dir per server (i.e. for all notebooks). So this issue
prevents us from running 2 notebooks using HiveContext.
I'll look in a proper Hive installation and I'm glad to know that this
dependency is gone in 2.0
Look forward to 2.1 :-) ;-)

-kr, Gerard.


On Thu, May 26, 2016 at 10:55 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> You can also just make sure that each user is using their own directory.
> A rough example can be found in TestHive.
>
> Note: in Spark 2.0 there should be no need to use HiveContext unless you
> need to talk to a metastore.
>
> On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well make sure than you set up a reasonable RDBMS as metastore. Ours is
>> Oracle but you can get away with others. Check the supported list in
>>
>> hduser@rhes564:: :/usr/lib/hive/scripts/metastore/upgrade> ltr
>> total 40
>> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 postgres
>> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mysql
>> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mssql
>> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 derby
>> drwxr-xr-x 3 hduser hadoop 4096 May 20 18:44 oracle
>>
>> you have few good ones in the list.  In general the base tables (without
>> transactional support) are around 55  (Hive 2) and don't take much space
>> (depending on the volume of tables). I attached a E-R diagram.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 26 May 2016 at 19:09, Gerard Maas <gerard.m...@gmail.com> wrote:
>>
>>> Thanks a lot for the advice!.
>>>
>>> I found out why the standalone hiveContext would not work:  it was
>>> trying to deploy a derby db and the user had no rights to create the dir
>>> where there db is stored:
>>>
>>> Caused by: java.sql.SQLException: Failed to create database
>>> 'metastore_db', see the next exception for details.
>>>
>>>at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>>
>>>at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>>> Source)
>>>
>>>... 129 more
>>>
>>> Caused by: java.sql.SQLException: Directory
>>> /usr/share/spark-notebook/metastore_db cannot be created.
>>>
>>>
>>> Now, the new issue is that we can't start more than 1 context at the
>>> same time. I think we will need to setup a proper metastore.
>>>
>>>
>>> -kind regards, Gerard.
>>>
>>>
>>>
>>>
>>> On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> To use HiveContext witch is basically an sql api within Spark without
>>>> proper hive set up does not make sense. It is a super set of Spark
>>>> SQLContext
>>>>
>>>> In addition simple things like registerTempTable may not work.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 26 May 2016 at 13:01, Silvio Fiorito <silvio.fior...@granturing.com>
>>>> wrote:
>>>>
>>>>> Hi Gerard,
>>>>>
>>>>>
>>>>>
>>>>> I’ve never had an issue using the HiveContext without a hive-site.xml
>>>>> configured. However, one issue you may have is if multiple users are
>>>>> starting the HiveContext from the same path, they’ll all be trying to 
>>>>> store
>>>>> the default Derby metastore in the same location. Also, if you want them 
>>>>> to
>>>>> b

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory.  A
rough example can be found in TestHive.

Note: in Spark 2.0 there should be no need to use HiveContext unless you
need to talk to a metastore.

On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Well make sure than you set up a reasonable RDBMS as metastore. Ours is
> Oracle but you can get away with others. Check the supported list in
>
> hduser@rhes564:: :/usr/lib/hive/scripts/metastore/upgrade> ltr
> total 40
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 postgres
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mysql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mssql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 derby
> drwxr-xr-x 3 hduser hadoop 4096 May 20 18:44 oracle
>
> you have few good ones in the list.  In general the base tables (without
> transactional support) are around 55  (Hive 2) and don't take much space
> (depending on the volume of tables). I attached a E-R diagram.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 26 May 2016 at 19:09, Gerard Maas <gerard.m...@gmail.com> wrote:
>
>> Thanks a lot for the advice!.
>>
>> I found out why the standalone hiveContext would not work:  it was trying
>> to deploy a derby db and the user had no rights to create the dir where
>> there db is stored:
>>
>> Caused by: java.sql.SQLException: Failed to create database
>> 'metastore_db', see the next exception for details.
>>
>>at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>> Source)
>>
>>at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>> Source)
>>
>>... 129 more
>>
>> Caused by: java.sql.SQLException: Directory
>> /usr/share/spark-notebook/metastore_db cannot be created.
>>
>>
>> Now, the new issue is that we can't start more than 1 context at the same
>> time. I think we will need to setup a proper metastore.
>>
>>
>> -kind regards, Gerard.
>>
>>
>>
>>
>> On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> To use HiveContext witch is basically an sql api within Spark without
>>> proper hive set up does not make sense. It is a super set of Spark
>>> SQLContext
>>>
>>> In addition simple things like registerTempTable may not work.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 26 May 2016 at 13:01, Silvio Fiorito <silvio.fior...@granturing.com>
>>> wrote:
>>>
>>>> Hi Gerard,
>>>>
>>>>
>>>>
>>>> I’ve never had an issue using the HiveContext without a hive-site.xml
>>>> configured. However, one issue you may have is if multiple users are
>>>> starting the HiveContext from the same path, they’ll all be trying to store
>>>> the default Derby metastore in the same location. Also, if you want them to
>>>> be able to persist permanent table metadata for SparkSQL then you’ll want
>>>> to set up a true metastore.
>>>>
>>>>
>>>>
>>>> The other thing it could be is Hive dependency collisions from the
>>>> classpath, but that shouldn’t be an issue since you said it’s standalone
>>>> (not a Hadoop distro right?).
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Silvio
>>>>
>>>>
>>>>
>>>> *From: *Gerard Maas <gerard.m...@gmail.com>
>>>> *Date: *Thursday, May 26, 2016 at 5:28 AM
>>>> *To: *spark users <user@spark.apache.org>
>>>> *Subject: *HiveContext standalone => without a Hive metastore
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I'm helping some fo

Re: Problem instantiation of HiveContext

2016-05-26 Thread Ian
The exception indicates that Spark cannot invoke the method it's trying to
call, which is probably caused by a library missing. Do you have a Hive
configuration (hive-site.xml) or similar in your $SPARK_HOME/conf folder?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-instantiation-of-HiveContext-tp26999p27035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Thanks a lot for the advice!.

I found out why the standalone hiveContext would not work:  it was trying
to deploy a derby db and the user had no rights to create the dir where
there db is stored:

Caused by: java.sql.SQLException: Failed to create database 'metastore_db',
see the next exception for details.

   at
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)

   at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
Source)

   ... 129 more

Caused by: java.sql.SQLException: Directory
/usr/share/spark-notebook/metastore_db cannot be created.


Now, the new issue is that we can't start more than 1 context at the same
time. I think we will need to setup a proper metastore.


-kind regards, Gerard.




On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> To use HiveContext witch is basically an sql api within Spark without
> proper hive set up does not make sense. It is a super set of Spark
> SQLContext
>
> In addition simple things like registerTempTable may not work.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 26 May 2016 at 13:01, Silvio Fiorito <silvio.fior...@granturing.com>
> wrote:
>
>> Hi Gerard,
>>
>>
>>
>> I’ve never had an issue using the HiveContext without a hive-site.xml
>> configured. However, one issue you may have is if multiple users are
>> starting the HiveContext from the same path, they’ll all be trying to store
>> the default Derby metastore in the same location. Also, if you want them to
>> be able to persist permanent table metadata for SparkSQL then you’ll want
>> to set up a true metastore.
>>
>>
>>
>> The other thing it could be is Hive dependency collisions from the
>> classpath, but that shouldn’t be an issue since you said it’s standalone
>> (not a Hadoop distro right?).
>>
>>
>>
>> Thanks,
>>
>> Silvio
>>
>>
>>
>> *From: *Gerard Maas <gerard.m...@gmail.com>
>> *Date: *Thursday, May 26, 2016 at 5:28 AM
>> *To: *spark users <user@spark.apache.org>
>> *Subject: *HiveContext standalone => without a Hive metastore
>>
>>
>>
>> Hi,
>>
>>
>>
>> I'm helping some folks setting up an analytics cluster with  Spark.
>>
>> They want to use the HiveContext to enable the Window functions on
>> DataFrames(*) but they don't have any Hive installation, nor they need one
>> at the moment (if not necessary for this feature)
>>
>>
>>
>> When we try to create a Hive context, we get the following error:
>>
>>
>>
>> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>>
>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>
>>at
>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>
>>
>>
>> Is my HiveContext failing b/c it wants to connect to an unconfigured
>>  Hive Metastore?
>>
>>
>>
>> Is there  a way to instantiate a HiveContext for the sake of Window
>> support without an underlying Hive deployment?
>>
>>
>>
>> The docs are explicit in saying that that is should be the case: [1]
>>
>>
>>
>> "To use a HiveContext, you do not need to have an existing Hive setup,
>> and all of the data sources available to aSQLContext are still
>> available. HiveContext is only packaged separately to avoid including
>> all of Hive’s dependencies in the default Spark build."
>>
>>
>>
>> So what is the right way to address this issue? How to instantiate a
>> HiveContext with spark running on a HDFS cluster without Hive deployed?
>>
>>
>>
>>
>>
>> Thanks a lot!
>>
>>
>>
>> -Gerard.
>>
>>
>>
>> (*) The need for a HiveContext to use Window functions is pretty obscure.
>> The only documentation of this seems to be a runtime exception: 
>> "org.apache.spark.sql.AnalysisException:
>> Could not resolve window function 'max'. Note that, using window functions
>> currently requires a HiveContext;"
>>
>>
>>
>> [1]
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>>
>
>


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
To use HiveContext witch is basically an sql api within Spark without
proper hive set up does not make sense. It is a super set of Spark
SQLContext

In addition simple things like registerTempTable may not work.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 May 2016 at 13:01, Silvio Fiorito <silvio.fior...@granturing.com>
wrote:

> Hi Gerard,
>
>
>
> I’ve never had an issue using the HiveContext without a hive-site.xml
> configured. However, one issue you may have is if multiple users are
> starting the HiveContext from the same path, they’ll all be trying to store
> the default Derby metastore in the same location. Also, if you want them to
> be able to persist permanent table metadata for SparkSQL then you’ll want
> to set up a true metastore.
>
>
>
> The other thing it could be is Hive dependency collisions from the
> classpath, but that shouldn’t be an issue since you said it’s standalone
> (not a Hadoop distro right?).
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *Gerard Maas <gerard.m...@gmail.com>
> *Date: *Thursday, May 26, 2016 at 5:28 AM
> *To: *spark users <user@spark.apache.org>
> *Subject: *HiveContext standalone => without a Hive metastore
>
>
>
> Hi,
>
>
>
> I'm helping some folks setting up an analytics cluster with  Spark.
>
> They want to use the HiveContext to enable the Window functions on
> DataFrames(*) but they don't have any Hive installation, nor they need one
> at the moment (if not necessary for this feature)
>
>
>
> When we try to create a Hive context, we get the following error:
>
>
>
> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>
>at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>
>
>
> Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
> Metastore?
>
>
>
> Is there  a way to instantiate a HiveContext for the sake of Window
> support without an underlying Hive deployment?
>
>
>
> The docs are explicit in saying that that is should be the case: [1]
>
>
>
> "To use a HiveContext, you do not need to have an existing Hive setup,
> and all of the data sources available to aSQLContext are still available.
> HiveContext is only packaged separately to avoid including all of Hive’s
> dependencies in the default Spark build."
>
>
>
> So what is the right way to address this issue? How to instantiate a
> HiveContext with spark running on a HDFS cluster without Hive deployed?
>
>
>
>
>
> Thanks a lot!
>
>
>
> -Gerard.
>
>
>
> (*) The need for a HiveContext to use Window functions is pretty obscure.
> The only documentation of this seems to be a runtime exception: 
> "org.apache.spark.sql.AnalysisException:
> Could not resolve window function 'max'. Note that, using window functions
> currently requires a HiveContext;"
>
>
>
> [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Silvio Fiorito
Hi Gerard,

I’ve never had an issue using the HiveContext without a hive-site.xml 
configured. However, one issue you may have is if multiple users are starting 
the HiveContext from the same path, they’ll all be trying to store the default 
Derby metastore in the same location. Also, if you want them to be able to 
persist permanent table metadata for SparkSQL then you’ll want to set up a true 
metastore.

The other thing it could be is Hive dependency collisions from the classpath, 
but that shouldn’t be an issue since you said it’s standalone (not a Hadoop 
distro right?).

Thanks,
Silvio

From: Gerard Maas <gerard.m...@gmail.com>
Date: Thursday, May 26, 2016 at 5:28 AM
To: spark users <user@spark.apache.org>
Subject: HiveContext standalone => without a Hive metastore

Hi,

I'm helping some folks setting up an analytics cluster with  Spark.
They want to use the HiveContext to enable the Window functions on 
DataFrames(*) but they don't have any Hive installation, nor they need one at 
the moment (if not necessary for this feature)

When we try to create a Hive context, we get the following error:

> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
   at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive 
Metastore?

Is there  a way to instantiate a HiveContext for the sake of Window support 
without an underlying Hive deployment?

The docs are explicit in saying that that is should be the case: [1]

"To use a HiveContext, you do not need to have an existing Hive setup, and all 
of the data sources available to aSQLContext are still available. HiveContext 
is only packaged separately to avoid including all of Hive’s dependencies in 
the default Spark build."

So what is the right way to address this issue? How to instantiate a 
HiveContext with spark running on a HDFS cluster without Hive deployed?


Thanks a lot!

-Gerard.

(*) The need for a HiveContext to use Window functions is pretty obscure. The 
only documentation of this seems to be a runtime exception: 
"org.apache.spark.sql.AnalysisException: Could not resolve window function 
'max'. Note that, using window functions currently requires a HiveContext;"

[1] 
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
Hi Gerald,

I am not sure the so called independence is will. I gather you want to use
HiveContext for your SQL queries and sqlContext only provides a subset of
HiveContext.

try this

  val sc = new SparkContext(conf)
 // Create sqlContext based on HiveContext
 val sqlContext = new HiveContext(sc)


However, ii will take 3 minutes to set up hive and all you need to add a
softlink from $SPARK_HOME/conf to hive-site.xml

hive-site.xml -> /usr/lib/hive/conf/hive-site.xml

The fact that it is not working shows that the statement in doc may not be
valid.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 May 2016 at 10:28, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi,
>
> I'm helping some folks setting up an analytics cluster with  Spark.
> They want to use the HiveContext to enable the Window functions on
> DataFrames(*) but they don't have any Hive installation, nor they need one
> at the moment (if not necessary for this feature)
>
> When we try to create a Hive context, we get the following error:
>
> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>
>at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>
> Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
> Metastore?
>
> Is there  a way to instantiate a HiveContext for the sake of Window
> support without an underlying Hive deployment?
>
> The docs are explicit in saying that that is should be the case: [1]
>
> "To use a HiveContext, you do not need to have an existing Hive setup,
> and all of the data sources available to aSQLContext are still available.
> HiveContext is only packaged separately to avoid including all of Hive’s
> dependencies in the default Spark build."
>
> So what is the right way to address this issue? How to instantiate a
> HiveContext with spark running on a HDFS cluster without Hive deployed?
>
>
> Thanks a lot!
>
> -Gerard.
>
> (*) The need for a HiveContext to use Window functions is pretty obscure.
> The only documentation of this seems to be a runtime exception: "
> org.apache.spark.sql.AnalysisException: Could not resolve window function
> 'max'. Note that, using window functions currently requires a HiveContext;"
>
>
> [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>


HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Hi,

I'm helping some folks setting up an analytics cluster with  Spark.
They want to use the HiveContext to enable the Window functions on
DataFrames(*) but they don't have any Hive installation, nor they need one
at the moment (if not necessary for this feature)

When we try to create a Hive context, we get the following error:

> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)

java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

   at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
Metastore?

Is there  a way to instantiate a HiveContext for the sake of Window support
without an underlying Hive deployment?

The docs are explicit in saying that that is should be the case: [1]

"To use a HiveContext, you do not need to have an existing Hive setup, and
all of the data sources available to aSQLContext are still available.
HiveContext is only packaged separately to avoid including all of Hive’s
dependencies in the default Spark build."

So what is the right way to address this issue? How to instantiate a
HiveContext with spark running on a HDFS cluster without Hive deployed?


Thanks a lot!

-Gerard.

(*) The need for a HiveContext to use Window functions is pretty obscure.
The only documentation of this seems to be a runtime exception: "
org.apache.spark.sql.AnalysisException: Could not resolve window function
'max'. Note that, using window functions currently requires a HiveContext;"


[1]
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started


Re: SQLContext and HiveContext parse a query string differently ?

2016-05-13 Thread Hao Ren
Basically, I want to run the following query:

select 'a\'b', case(null as Array)

However, neither HiveContext and SQLContext can execute it without
exception.

I have tried

sql(select 'a\'b', case(null as Array))

and

df.selectExpr("'a\'b'", "case(null as Array)")

Neither of them works.

>From the exceptions, I find the query is parsed differently.



On Fri, May 13, 2016 at 8:00 AM, Yong Zhang <java8...@hotmail.com> wrote:

> Not sure what do you mean? You want to have one exactly query running fine
> in both sqlContext and HiveContext? The query parser are different, why do
> you want to have this feature? Do I understand your question correctly?
>
> Yong
>
> --
> Date: Thu, 12 May 2016 13:09:34 +0200
> Subject: SQLContext and HiveContext parse a query string differently ?
> From: inv...@gmail.com
> To: user@spark.apache.org
>
>
> HI,
>
> I just want to figure out why the two contexts behavior differently even
> on a simple query.
> In a netshell, I have a query in which there is a String containing single
> quote and casting to Array/Map.
> I have tried all the combination of diff type of sql context and query
> call api (sql, df.select, df.selectExpr).
> I can't find one rules all.
>
> Here is the code for reproducing the problem.
>
> -
>
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
>
> object Test extends App {
>
>   val sc  = new SparkContext("local[2]", "test", new SparkConf)
>   val hiveContext = new HiveContext(sc)
>   val sqlContext  = new SQLContext(sc)
>
>   val context = hiveContext
>   //  val context = sqlContext
>
>   import context.implicits._
>
>   val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
>   df.registerTempTable("tbl")
>   df.printSchema()
>
>   // case 1
>   context.sql("select cast(a as array) from tbl").show()
>   // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
> input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
>   // SQLContext => OK
>
>   // case 2
>   context.sql("select 'a\\'b'").show()
>   // HiveContext => OK
>   // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
> literal) found
>
>   // case 3
>   df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
> SQLContext
>
>   // case 4
>   df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end 
> of input expected
> }
>
> -
>
> Any clarification / workaround is high appreciated.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>



-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
Sorry, the bug link in previous mail was is wrong. 


Here is the real link:


http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html











At 2016-05-13 09:49:05, "李明伟" <kramer2...@126.com> wrote:

It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n2694...@n3.nabble.com> wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML




 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu
The link below doesn't refer to specific bug. 

Can you send the correct link ?

Thanks 

> On May 12, 2016, at 6:50 PM, "kramer2...@126.com" <kramer2...@126.com> wrote:
> 
> It seems we hit the same issue.
> 
> There was a bug on 1.5.1 about memory leak. But I am using 1.6.1
> 
> Here is the link about the bug in 1.5.1 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> 
> 
> 
> 
> 
> At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <[hidden 
> email]> wrote:
> I read with Spark-Streaming from a Port. The incoming data consists of key 
> and value pairs. Then I call forEachRDD on each window. There I create a 
> Dataset from the window and do some SQL Querys on it. On the result i only do 
> show, to see the content. It works well, but the memory usage increases. When 
> it reaches the maximum nothing works anymore. When I use more memory. The 
> Program runs some time longer, but the problem persists. Because I run a 
> Programm which writes to the Port, I can control perfectly how much Data 
> Spark has to Process. When I write every one ms one key and value Pair the 
> Problem is the same as when i write only every second a key and value pair to 
> the port. 
> 
> When I dont create a Dataset in the foreachRDD and only count the Elements in 
> the RDD, then everything works fine. I also use groupBy agg functions in the 
> querys. 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
> To unsubscribe from Will the HiveContext cause memory leak ?, click here.
> NAML
> 
> 
>  
> 
> 
> View this message in context: Re:Re: Re:Re: Will the HiveContext cause memory 
> leak ?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n2694...@n3.nabble.com> wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Yong Zhang
Not sure what do you mean? You want to have one exactly query running fine in 
both sqlContext and HiveContext? The query parser are different, why do you 
want to have this feature? Do I understand your question correctly?
Yong

Date: Thu, 12 May 2016 13:09:34 +0200
Subject: SQLContext and HiveContext parse a query string differently ?
From: inv...@gmail.com
To: user@spark.apache.org

HI,
I just want to figure out why the two contexts behavior differently even on a 
simple query.In a netshell, I have a query in which there is a String 
containing single quote and casting to Array/Map.I have tried all the 
combination of diff type of sql context and query call api (sql, df.select, 
df.selectExpr).I can't find one rules all.
Here is the code for reproducing the 
problem.-
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object Test extends App {

  val sc  = new SparkContext("local[2]", "test", new SparkConf)
  val hiveContext = new HiveContext(sc)
  val sqlContext  = new SQLContext(sc)

  val context = hiveContext
  //  val context = sqlContext

  import context.implicits._

  val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
  df.registerTempTable("tbl")
  df.printSchema()

  // case 1
  context.sql("select cast(a as array) from tbl").show()
  // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
  // SQLContext => OK

  // case 2
  context.sql("select 'a\\'b'").show()
  // HiveContext => OK
  // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
literal) found

  // case 3
  df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
SQLContext

  // case 4
  df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end of 
input expected
}-
Any clarification / workaround is high appreciated.
-- 
Hao Ren
Data Engineer @ leboncoin
Paris, France
  

Re: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Mich Talebzadeh
)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot
recognize input near 'array' '<' 'string' in primitive type specification;
line 1 pos 17


Let me investigate it further


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 12 May 2016 at 12:09, Hao Ren <inv...@gmail.com> wrote:

> HI,
>
> I just want to figure out why the two contexts behavior differently even
> on a simple query.
> In a netshell, I have a query in which there is a String containing single
> quote and casting to Array/Map.
> I have tried all the combination of diff type of sql context and query
> call api (sql, df.select, df.selectExpr).
> I can't find one rules all.
>
> Here is the code for reproducing the problem.
>
> -
>
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
>
> object Test extends App {
>
>   val sc  = new SparkContext("local[2]", "test", new SparkConf)
>   val hiveContext = new HiveContext(sc)
>   val sqlContext  = new SQLContext(sc)
>
>   val context = hiveContext
>   //  val context = sqlContext
>
>   import context.implicits._
>
>   val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
>   df.registerTempTable("tbl")
>   df.printSchema()
>
>   // case 1
>   context.sql("select cast(a as array) from tbl").show()
>   // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
> input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
>   // SQLContext => OK
>
>   // case 2
>   context.sql("select 'a\\'b'").show()
>   // HiveContext => OK
>   // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
> literal) found
>
>   // case 3
>   df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
> SQLContext
>
>   // case 4
>   df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end 
> of input expected
> }
>
> -
>
> Any clarification / workaround is high appreciated.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Hao Ren
HI,

I just want to figure out why the two contexts behavior differently even on
a simple query.
In a netshell, I have a query in which there is a String containing single
quote and casting to Array/Map.
I have tried all the combination of diff type of sql context and query call
api (sql, df.select, df.selectExpr).
I can't find one rules all.

Here is the code for reproducing the problem.
-

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object Test extends App {

  val sc  = new SparkContext("local[2]", "test", new SparkConf)
  val hiveContext = new HiveContext(sc)
  val sqlContext  = new SQLContext(sc)

  val context = hiveContext
  //  val context = sqlContext

  import context.implicits._

  val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
  df.registerTempTable("tbl")
  df.printSchema()

  // case 1
  context.sql("select cast(a as array) from tbl").show()
  // HiveContext => org.apache.spark.sql.AnalysisException: cannot
recognize input near 'array' '<' 'string' in primitive type
specification; line 1 pos 17
  // SQLContext => OK

  // case 2
  context.sql("select 'a\\'b'").show()
  // HiveContext => OK
  // SQLContext => failure: ``union'' expected but ErrorToken(unclosed
string literal) found

  // case 3
  df.selectExpr("cast(a as array)").show() // OK with
HiveContext and SQLContext

  // case 4
  df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext =>
failure: end of input expected
}

-

Any clarification / workaround is high appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon


Can you describe your problem in more details? 
I suspect that my problem is because the window function (or may be the groupBy 
agg functions).
If you are the same. May be we should report a bug 






At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n26930...@n3.nabble.com> wrote:
I have the same Problem with Spark-2.0.0 Snapshot with Streaming. There I use 
Datasets instead of Dataframes. I hope you or someone will find a solution.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26930.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26934.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again... 

eventually , the stream program crashed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.


But it still do not make sense. Because my workload is very small.


I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.


So spark is using almost 10 000 times of memory than my workload. Does that
mean I need prepare 1TB RAM if the workload is 100M?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re:Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread 李明伟
Hi  Ted


Spark version :  spark-1.6.0-bin-hadoop2.6
I tried increase the memory of executor. Still have the same problem.
I can use jmap to capture some thing. But the output is too difficult to 
understand. 










在 2016-05-11 11:50:14,"Ted Yu" <yuzhih...@gmail.com> 写道:

Which Spark release are you using ?


I assume executor crashed due to OOME.


Did you have a chance to capture jmap on the executor before it crashed ?


Have you tried giving more memory to the executor ?


Thanks


On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com<kramer2...@126.com> wrote:
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py>
Please advice me if it is a bug..

Also here is the submit command

nohup ./bin/spark-submit  \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu
Which Spark release are you using ?

I assume executor crashed due to OOME.

Did you have a chance to capture jmap on the executor before it crashed ?

Have you tried giving more memory to the executor ?

Thanks

On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com <kramer2...@126.com>
wrote:

> I submit my code to a spark stand alone cluster. Find the memory usage
> executor process keeps growing. Which cause the program to crash.
>
> I modified the code and submit several times. Find below 4 line may causing
> the issue
>
> dataframe =
>
> dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
> windowSpec =
> Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
> rank = func.dense_rank().over(windowSpec)
> ret =
>
> dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
> rank.alias('rank')).filter("rank<=2")
>
> It looks a little complicated but it is just some Window function on
> dataframe. I use the HiveContext because SQLContext do not support window
> function yet. Without the 4 line, my code can run all night. Adding them
> will cause the memory leak. Program will crash in a few hours.
>
> I will provided the whole code (50 lines)here.  ForAsk01.py
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py
> >
> Please advice me if it is a bug..
>
> Also here is the submit command
>
> nohup ./bin/spark-submit  \
> --master spark://ES01:7077 \
> --executor-memory 4G \
> --num-executors 1 \
> --total-executor-cores 1 \
> --conf "spark.storage.memoryFraction=0.2"  \
> ./ForAsk.py 1>a.log 2>b.log &
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py>  
Please advice me if it is a bug..

Also here is the submit command 

nohup ./bin/spark-submit  \  
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to stop hivecontext

2016-04-15 Thread Ted Yu
You can call stop() method. 

> On Apr 15, 2016, at 5:21 AM, ram kumar <ramkumarro...@gmail.com> wrote:
> 
> Hi,
> I started hivecontext as,
> 
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
> 
> I want to stop this sql context
> 
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to stop hivecontext

2016-04-15 Thread ram kumar
Hi,
I started hivecontext as,

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);

I want to stop this sql context

Thanks


HiveContext in spark

2016-04-12 Thread Selvam Raman
I Could not able to use Insert , update and delete command in HiveContext.

i am using spark 1.6.1 version and hive 1.1.0

Please find the error below.



​scala> hc.sql("delete from  trans_detail where counter=1");
16/04/12 14:58:45 INFO ParseDriver: Parsing command: delete from
 trans_detail where counter=1
16/04/12 14:58:45 INFO ParseDriver: Parse Completed
16/04/12 14:58:45 INFO ParseDriver: Parsing command: delete from
 trans_detail where counter=1
16/04/12 14:58:45 INFO ParseDriver: Parse Completed
16/04/12 14:58:45 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
localhost:60409 in memory (size: 46.9 KB, free: 536.7 MB)
16/04/12 14:58:46 INFO ContextCleaner: Cleaned accumulator 3
16/04/12 14:58:46 INFO BlockManagerInfo: Removed broadcast_4_piece0 on
localhost:60409 in memory (size: 3.6 KB, free: 536.7 MB)
org.apache.spark.sql.AnalysisException:
Unsupported language features in query: delete from  trans_detail where
counter=1
TOK_DELETE_FROM 1, 0,11, 13
  TOK_TABNAME 1, 5,5, 13
trans_detail 1, 5,5, 13
  TOK_WHERE 1, 7,11, 39
= 1, 9,11, 39
  TOK_TABLE_OR_COL 1, 9,9, 32
counter 1, 9,9, 32
  1 1, 11,11, 40

scala.NotImplementedError: No parse rules for TOK_DELETE_FROM:
 TOK_DELETE_FROM 1, 0,11, 13
  TOK_TABNAME 1, 5,5, 13
trans_detail 1, 5,5, 13
  TOK_WHERE 1, 7,11, 39
= 1, 9,11, 39
  TOK_TABLE_OR_COL 1, 9,9, 32
counter 1, 9,9, 32
  1 1, 11,11, 40

org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:1217)
​



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: HiveContext unable to recognize the delimiter of Hive table in textfile partitioned by date

2016-04-11 Thread Shiva Achari
Hi All,

In the above scenario if the field delimiter is default of hive then Spark
is able to parse the data as expected , hence i believe this is a bug.

​Regards,
Shiva Achari​


On Tue, Apr 5, 2016 at 8:15 PM, Shiva Achari  wrote:

> Hi,
>
> I have created a hive external table stored as textfile partitioned by
> event_date Date.
>
> How do we have to specify a specific format of csv while reading in spark
> from Hive table ?
>
> The environment is
>
>  1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java
> HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
>  2. Hive 1.1, CDH 5.5.1
>
> scala script
>
> sqlContext.setConf("hive.exec.dynamic.partition", "true")
> sqlContext.setConf("hive.exec.dynamic.partition.mode",
> "nonstrict")
>
> val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3,
> 3))).toDF
> val distData_1 = distData.withColumn("event_date", current_date())
> distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int,
> _3: int, event_date: date]
>
> scala > distData_1.show
> + ---+---+---+--+
> |_1 |_2 |_3 | event_date |
> | 1 | 1 | 1 | 2016-03-25 |
> | 2 | 2 | 2 | 2016-03-25 |
> | 3 | 3 | 3 | 2016-03-25 |
>
>
> distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
>
>
> scala > sqlContext.sql("select * from part_table").show
> | a| b| c| event_date |
> |1,1,1 | null | null | 2016-03-25 |
> |2,2,2 | null | null | 2016-03-25 |
> |3,3,3 | null | null | 2016-03-25 |
>
>
>
> Hive table
>
> create external table part_table (a String, b int, c bigint)
> partitioned by (event_date Date)
> row format delimited fields terminated by ','
> stored as textfile  LOCATION "/user/hdfs/hive/part_table";
>
> select * from part_table shows
> |part_table.a | part_table.b | part_table.c |
> part_table.event_date |
> |1 |1 |1
>  |2016-03-25
> |2 |2 |2
>  |2016-03-25
> |3 |3 |3
>  |2016-03-25
>
>
> Looking at the hdfs
>
>
> The path has 2 part files
> /user/hdfs/hive/part_table/event_date=2016-03-25
> part-0
> part-1
>
>   part-0 content
> 1,1,1
>   part-1 content
> 2,2,2
> 3,3,3
>
>
> P.S. if we store the table as orc it writes and reads the data as
> expected.
>
>


Spark demands HiveContext but I use only SqlContext

2016-04-11 Thread AlexModestov
Hello!
I work with SqlContext, I create a query to MS Sql Server and get data...
Spark says to me that I have to install hive...
I have started to use Spark 1.6.1 (before I used Spark 1.5 and I have never
heard about this necessity early)... 


Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-demands-HiveContext-but-I-use-only-SqlContext-tp26738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HiveContext unable to recognize the delimiter of Hive table in textfile partitioned by date

2016-04-05 Thread Shiva Achari
Hi,

I have created a hive external table stored as textfile partitioned by
event_date Date.

How do we have to specify a specific format of csv while reading in spark
from Hive table ?

The environment is

 1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java
HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
 2. Hive 1.1, CDH 5.5.1

scala script

sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3,
3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3:
int, event_date: date]

scala > distData_1.show
+ ---+---+---+--+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |


distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")


scala > sqlContext.sql("select * from part_table").show
| a| b| c| event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |



Hive table

create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile  LOCATION "/user/hdfs/hive/part_table";

select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date
|
|1 |1 |1
 |2016-03-25
|2 |2 |2
 |2016-03-25
|3 |3 |3
 |2016-03-25


Looking at the hdfs


The path has 2 part files
/user/hdfs/hive/part_table/event_date=2016-03-25
part-0
part-1

  part-0 content
1,1,1
  part-1 content
2,2,2
3,3,3


P.S. if we store the table as orc it writes and reads the data as expected.


Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter
which query I run and how much data it is, it always generates 31
partitions. Anybody knows the reason? Is there a predefined/configurable
setting for it? I essentially need more partitions.

I using this code snippet to execute hive query:

/var pairedRDD = hqlContext.sql(hql).rdd.map(...)/

Thanks,
Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Hive-query-through-HiveContext-always-creating-31-partitions-tp26671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



FW: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Jelez Raditchkov

 
From: je...@hotmail.com
To: yuzhih...@gmail.com
Subject: RE: How to get the singleton instance of SQLContext/HiveContext: val 
sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏
Date: Fri, 4 Mar 2016 14:09:20 -0800




Below code is from the soruces, is this what you ask?
 
class HiveContext private[hive]( 
79 sc: SparkContext, 
80 cacheManager: CacheManager, 
81 listener: SQLListener, 
82 @transient private val execHive: HiveClientImpl, 
83 @transient private val metaHive: HiveClient, 
84 isRootContext: Boolean) 
85   extends SQLContext(sc, cacheManager, listener, isRootContext) with Logging 
{ 

 
J
 
Date: Fri, 4 Mar 2016 13:53:38 -0800
Subject: Re: How to get the singleton instance of SQLContext/HiveContext: val 
sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏
From: yuzhih...@gmail.com
To: je...@hotmail.com
CC: user@spark.apache.org

bq. However the method does not seem inherited to HiveContext.
Can you clarify the above observation ?HiveContext extends SQLContext .

On Fri, Mar 4, 2016 at 1:23 PM, jelez <je...@hotmail.com> wrote:
What is the best approach to use getOrCreate for streaming job with

HiveContext.

It seems for SQLContext the recommended approach is to use getOrCreate:

https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations

val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

However the method does not seem inherited to HiveContext.

I currently create my own singleton class and use it like this:

val sqlContext =

SQLHiveContextSingleton.getInstance(linesRdd.sparkContext)



However, i am not sure if this is reliable. What would be the best approach?

Any examples?







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-singleton-instance-of-SQLContext-HiveContext-val-sqlContext-SQLContext-getOrCreate-rd-tp26399.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





  

Re: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Ted Yu
bq. However the method does not seem inherited to HiveContext.

Can you clarify the above observation ?
HiveContext extends SQLContext .

On Fri, Mar 4, 2016 at 1:23 PM, jelez <je...@hotmail.com> wrote:

> What is the best approach to use getOrCreate for streaming job with
> HiveContext.
> It seems for SQLContext the recommended approach is to use getOrCreate:
>
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
> val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
> However the method does not seem inherited to HiveContext.
> I currently create my own singleton class and use it like this:
> val sqlContext =
> SQLHiveContextSingleton.getInstance(linesRdd.sparkContext)
>
> However, i am not sure if this is reliable. What would be the best
> approach?
> Any examples?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-singleton-instance-of-SQLContext-HiveContext-val-sqlContext-SQLContext-getOrCreate-rd-tp26399.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread jelez
What is the best approach to use getOrCreate for streaming job with
HiveContext.
It seems for SQLContext the recommended approach is to use getOrCreate:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
However the method does not seem inherited to HiveContext.
I currently create my own singleton class and use it like this:
val sqlContext =
SQLHiveContextSingleton.getInstance(linesRdd.sparkContext)

However, i am not sure if this is reliable. What would be the best approach?
Any examples?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-singleton-instance-of-SQLContext-HiveContext-val-sqlContext-SQLContext-getOrCreate-rd-tp26399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

2016-03-04 Thread Jelez Raditchkov
What is the best approach to use getOrCreate for streaming job with 
HiveContext.It seems for SQLContext the recommended approach is to use 
getOrCreate: 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operationsval
 sqlContext = SQLContext.getOrCreate(rdd.sparkContext)However the method does 
not seem inherited to HiveContext.I currently create my own singleton class and 
use it like this:val sqlContext = 
SQLHiveContextSingleton.getInstance(linesRdd.sparkContext)
However, i am not sure if this is reliable. What would be the best approach?Any 
examples?   
  

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
 Let us start by counting the number of occurrences of "Berkeley" 
>>> across all Wikipedia articles
>>> val count = succinctRDD.count("the")
>>>
>>> // Now suppose we want to find all offsets in the collection at which 
>>> ìBerkeleyî occurs; and
>>>     // create an RDD containing all resulting offsets
>>> val offsetsRDD = succinctRDD.search("and")
>>>
>>> // Let us look at the first ten results in the above RDD
>>> val offsets = offsetsRDD.take(10)
>>>
>>> // Finally, let us extract 20 bytes before and after one of the 
>>> occurrences of ìBerkeleyî
>>> val offset = offsets(0)
>>> val data = succinctRDD.extract(offset - 20, 40)
>>>
>>> println(data)
>>> println(">>>")
>>>
>>>
>>> // Create a schema
>>> val citySchema = StructType(Seq(
>>>   StructField("Name", StringType, false),
>>>   StructField("Length", IntegerType, true),
>>>   StructField("Area", DoubleType, false),
>>>   StructField("Airport", BooleanType, true)))
>>>
>>> // Create an RDD of Rows with some data
>>> val cityRDD = sc.parallelize(Seq(
>>>   Row("San Francisco", 12, 44.52, true),
>>>   Row("Palo Alto", 12, 22.33, false),
>>>   Row("Munich", 8, 3.14, true)))
>>>
>>>
>>> val hiveContext = new HiveContext(sc)
>>>
>>> //val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>
>>>   }
>>> }
>>>
>>>
>>> -
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>
>>
>


Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Hi,

FIRST ATTEMPT:
Use build.sbt in IntelliJ and it was giving me nightmares with several
incompatibility and library issues though the sbt version was compliant
with the scala version

SECOND ATTEMPT:
Created a new project with no entries in build.sbt file and imported all
the files in $SPARK_HOME/lib/*jar into the project. This started causing
issues I reported earlier

FINAL ATTEMPT:
removed all the files from the import (removing them from dependencies)
which had the word derby in it and this resolved the issue.

Please note that the following additional jars were included in the library
folder than the ones which are usually supplied with the SPARK distribution:
1. ojdbc7.jar
2. spark-csv***jar file


Regards,
Gourav Sengupta

On Tue, Mar 1, 2016 at 5:19 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> I am getting the error  "*java.lang.SecurityException: sealing violation:
> can't seal package org.apache.derby.impl.services.locks: already loaded"*
>   after running the following code in SCALA.
>
> I do not have any other instances of sparkContext running from my system.
>
> I will be grateful for if anyone could kindly help me out.
>
>
> Environment:
> SCALA: 1.6
> OS: MAC OS X
>
> 
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
>
> // Import SuccinctRDD
> import edu.berkeley.cs.succinct._
>
> object test1 {
>   def main(args: Array[String]) {
> //the below line returns nothing
> println(SparkContext.jarOfClass(this.getClass).toString())
> val logFile = "/tmp/README.md" // Should be some file on your system
>
> val conf = new 
> SparkConf().setAppName("IdeaProjects").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val logData = sc.textFile(logFile, 2).cache()
> val numAs = logData.filter(line => line.contains("a")).count()
> val numBs = logData.filter(line => line.contains("b")).count()
> println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>
>
> // Create a Spark RDD as a collection of articles; ctx is the SparkContext
> val articlesRDD = sc.textFile("/tmp/README.md").map(_.getBytes)
>
> // Compress the Spark RDD into a Succinct Spark RDD, and persist it in 
> memory
> // Note that this is a time consuming step (usually at 8GB/hour/core) 
> since data needs to be compressed.
> // We are actively working on making this step faster.
> val succinctRDD = articlesRDD.succinct.persist()
>
>
> // SuccinctRDD supports a set of powerful primitives directly on 
> compressed RDD
> // Let us start by counting the number of occurrences of "Berkeley" 
> across all Wikipedia articles
> val count = succinctRDD.count("the")
>
> // Now suppose we want to find all offsets in the collection at which 
> ìBerkeleyî occurs; and
> // create an RDD containing all resulting offsets
> val offsetsRDD = succinctRDD.search("and")
>
> // Let us look at the first ten results in the above RDD
> val offsets = offsetsRDD.take(10)
>
> // Finally, let us extract 20 bytes before and after one of the 
> occurrences of ìBerkeleyî
> val offset = offsets(0)
> val data = succinctRDD.extract(offset - 20, 40)
>
> println(data)
> println(">>>")
>
>
> // Create a schema
> val citySchema = StructType(Seq(
>   StructField("Name", StringType, false),
>   StructField("Length", IntegerType, true),
>   StructField("Area", DoubleType, false),
>   StructField("Airport", BooleanType, true)))
>
> // Create an RDD of Rows with some data
> val cityRDD = sc.parallelize(Seq(
>   Row("San Francisco", 12, 44.52, true),
>   Row("Palo Alto", 12, 22.33, false),
>   Row("Munich", 8, 3.14, true)))
>
>
> val hiveContext = new HiveContext(sc)
>
> //val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
>   }
> }
>
>
> -
>
>
>
> Regards,
> Gourav Sengupta
>


SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Hi,

I am getting the error  "*java.lang.SecurityException: sealing violation:
can't seal package org.apache.derby.impl.services.locks: already loaded"*
after running the following code in SCALA.

I do not have any other instances of sparkContext running from my system.

I will be grateful for if anyone could kindly help me out.


Environment:
SCALA: 1.6
OS: MAC OS X



import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext

// Import SuccinctRDD
import edu.berkeley.cs.succinct._

object test1 {
  def main(args: Array[String]) {
//the below line returns nothing
println(SparkContext.jarOfClass(this.getClass).toString())
val logFile = "/tmp/README.md" // Should be some file on your system

val conf = new SparkConf().setAppName("IdeaProjects").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))


// Create a Spark RDD as a collection of articles; ctx is the SparkContext
val articlesRDD = sc.textFile("/tmp/README.md").map(_.getBytes)

// Compress the Spark RDD into a Succinct Spark RDD, and persist
it in memory
// Note that this is a time consuming step (usually at
8GB/hour/core) since data needs to be compressed.
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succinct.persist()


// SuccinctRDD supports a set of powerful primitives directly on
compressed RDD
// Let us start by counting the number of occurrences of
"Berkeley" across all Wikipedia articles
val count = succinctRDD.count("the")

// Now suppose we want to find all offsets in the collection at
which ìBerkeleyî occurs; and
// create an RDD containing all resulting offsets
val offsetsRDD = succinctRDD.search("and")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the
occurrences of ìBerkeleyî
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

println(data)
println(">>>")


// Create a schema
val citySchema = StructType(Seq(
  StructField("Name", StringType, false),
  StructField("Length", IntegerType, true),
  StructField("Area", DoubleType, false),
  StructField("Airport", BooleanType, true)))

// Create an RDD of Rows with some data
val cityRDD = sc.parallelize(Seq(
  Row("San Francisco", 12, 44.52, true),
  Row("Palo Alto", 12, 22.33, false),
  Row("Munich", 8, 3.14, true)))


val hiveContext = new HiveContext(sc)

//val sqlContext = new org.apache.spark.sql.SQLContext(sc)

  }
}


-



Regards,
Gourav Sengupta


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Gavin Yue
This sqlContext is one instance of hive context, do not be confused by the 
name.  



> On Feb 16, 2016, at 12:51, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote:
> 
> Hi All,
> 
> On creating HiveContext in spark-shell, fails with 
> 
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /SPARK/metastore_db.
> 
> Spark-Shell already has created metastore_db for SqlContext. 
> 
> Spark context available as sc.
> SQL context available as sqlContext.
> 
> But without HiveContext, i am able to query the data using SqlContext . 
> 
> scala>  var df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load("/SPARK/abc")
> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
> 
> So is there any real need for HiveContext inside Spark Shell. Is everything 
> that can be done with HiveContext, achievable with SqlContext inside Spark 
> Shell.
> 
> 
> 
> Thanks,
> Prabhu Joseph
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
Thanks Mark, that answers my question.

On Tue, Feb 16, 2016 at 10:55 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>
>   /_/
>
>
>
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_72)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
>
> scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
>
> res0: Boolean = true
>
>
>
> On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph <prabhujose.ga...@gmail.com
> > wrote:
>
>> Hi All,
>>
>> On creating HiveContext in spark-shell, fails with
>>
>> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
>> the database /SPARK/metastore_db.
>>
>> Spark-Shell already has created metastore_db for SqlContext.
>>
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> But without HiveContext, i am able to query the data using SqlContext .
>>
>> scala>  var df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/SPARK/abc")
>> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
>>
>> So is there any real need for HiveContext inside Spark Shell. Is
>> everything that can be done with HiveContext, achievable with SqlContext
>> inside Spark Shell.
>>
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>>
>


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Mark Hamstra
Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT

  /_/



Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)

Type in expressions to have them evaluated.

Type :help for more information.


scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]

res0: Boolean = true



On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:

> Hi All,
>
> On creating HiveContext in spark-shell, fails with
>
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
> the database /SPARK/metastore_db.
>
> Spark-Shell already has created metastore_db for SqlContext.
>
> Spark context available as sc.
> SQL context available as sqlContext.
>
> But without HiveContext, i am able to query the data using SqlContext .
>
> scala>  var df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").option("inferSchema", "true").load("/SPARK/abc")
> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
>
> So is there any real need for HiveContext inside Spark Shell. Is
> everything that can be done with HiveContext, achievable with SqlContext
> inside Spark Shell.
>
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>


Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
Hi All,

On creating HiveContext in spark-shell, fails with

Caused by: ERROR XSDB6: Another instance of Derby may have already booted
the database /SPARK/metastore_db.

Spark-Shell already has created metastore_db for SqlContext.

Spark context available as sc.
SQL context available as sqlContext.

But without HiveContext, i am able to query the data using SqlContext .

scala>  var df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/SPARK/abc")
df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]

So is there any real need for HiveContext inside Spark Shell. Is everything
that can be done with HiveContext, achievable with SqlContext inside Spark
Shell.



Thanks,
Prabhu Joseph


Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-08 Thread Shipper, Jay [USA]
I looked back into this today.  I made some changes last week to the 
application to allow for not only compatibility with Spark 1.5.2, but also 
backwards compatibility with Spark 1.4.1 (the version our current deployment 
uses).  The changes mostly involved changing dependencies from compile to 
provided scope, while also removing some conflicting dependencies with what’s 
bundled in the Spark assembled JAR, particularly Scala and SLF4J libraries.  
Now, the application works fine with Spark 1.6.0; the NPE is not occurring, no 
patch necessary.  So unfortunately, I won’t be able to help determine the root 
cause, as I cannot replicate this issue.

Thanks for your help.

From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Date: Friday, February 5, 2016 at 5:40 PM
To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [External] Re: Spark 1.6.0 HiveContext NPE

Was there any other exception(s) in the client log ?

Just want to find the cause for this NPE.

Thanks

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] 
<shipper_...@bah.com<mailto:shipper_...@bah.com>> wrote:
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a 
NullPointerException from HiveContext.  It’s happening while it tries to load 
some tables via JDBC from an external database (not Hive), using 
context.read().jdbc():

—
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
—

Even though the application is not using Hive, HiveContext is used instead of 
SQLContext, for the additional functionality it provides.  There’s no 
hive-site.xml for the application, but this did not cause an issue for Spark 
1.4.1.

Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that could 
explain this NPE?  The only obvious change I’ve noticed for HiveContext is that 
the default warehouse location is different (1.4.1 - current directory, 1.6.0 - 
/user/hive/warehouse), but I verified that this NPE happens even when 
/user/hive/warehouse exists and is readable/writeable for the application.  In 
terms of changes to the application to work with Spark 1.6.0, the only one that 
might be relevant to this issue is the upgrade in the Hadoop dependencies to 
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT).

Thanks,
Jay



Re: Spark 1.6.0 HiveContext NPE

2016-02-05 Thread Ted Yu
Was there any other exception(s) in the client log ?

Just want to find the cause for this NPE.

Thanks

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] <shipper_...@bah.com>
wrote:

> I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m
> getting a NullPointerException from HiveContext.  It’s happening while it
> tries to load some tables via JDBC from an external database (not Hive),
> using context.read().jdbc():
>
> —
> java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
> at
> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
> at
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
> at
> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
> —
>
> Even though the application is not using Hive, HiveContext is used instead
> of SQLContext, for the additional functionality it provides.  There’s no
> hive-site.xml for the application, but this did not cause an issue for
> Spark 1.4.1.
>
> Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that
> could explain this NPE?  The only obvious change I’ve noticed for
> HiveContext is that the default warehouse location is different (1.4.1 -
> current directory, 1.6.0 - /user/hive/warehouse), but I verified that this
> NPE happens even when /user/hive/warehouse exists and is readable/writeable
> for the application.  In terms of changes to the application to work with
> Spark 1.6.0, the only one that might be relevant to this issue is the
> upgrade in the Hadoop dependencies to match what Spark 1.6.0 uses
> (2.6.0-cdh5.7.0-SNAPSHOT).
>
> Thanks,
> Jay
>


Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-04 Thread Ted Yu
Jay:
It would be nice if you can patch Spark with below PR and give it a try.

Thanks

On Wed, Feb 3, 2016 at 6:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Created a pull request:
> https://github.com/apache/spark/pull/11066
>
> FYI
>
> On Wed, Feb 3, 2016 at 1:27 PM, Shipper, Jay [USA] <shipper_...@bah.com>
> wrote:
>
>> It was just renamed recently: https://github.com/apache/spark/pull/10981
>>
>> As SessionState is entirely managed by Spark’s code, it still seems like
>> this is a bug with Spark 1.6.0, and not with how our application is using
>> HiveContext.  But I’d feel more confident filing a bug if someone else
>> could confirm they’re having this issue with Spark 1.6.0.  Ideally, we
>> should also have some simple proof of concept that can be posted with the
>> bug.
>>
>> From: Ted Yu <yuzhih...@gmail.com>
>> Date: Wednesday, February 3, 2016 at 3:57 PM
>> To: Jay Shipper <shipper_...@bah.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE
>>
>> In ClientWrapper.scala, the SessionState.get().getConf call might have
>> been executed ahead of SessionState.start(state) at line 194.
>>
>> This was the JIRA:
>>
>> [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL
>>
>> In master branch, there is no more ClientWrapper.scala
>>
>> FYI
>>
>> On Wed, Feb 3, 2016 at 11:15 AM, Shipper, Jay [USA] <shipper_...@bah.com>
>> wrote:
>>
>>> One quick update on this: The NPE is not happening with Spark 1.5.2, so
>>> this problem seems specific to Spark 1.6.0.
>>>
>>> From: Jay Shipper <shipper_...@bah.com>
>>> Date: Wednesday, February 3, 2016 at 12:06 PM
>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE
>>>
>>> Right, I could already tell that from the stack trace and looking at
>>> Spark’s code.  What I’m trying to determine is why that’s coming back as
>>> null now, just from upgrading Spark to 1.6.0.
>>>
>>> From: Ted Yu <yuzhih...@gmail.com>
>>> Date: Wednesday, February 3, 2016 at 12:04 PM
>>> To: Jay Shipper <shipper_...@bah.com>
>>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: [External] Re: Spark 1.6.0 HiveContext NPE
>>>
>>> Looks like the NPE came from this line:
>>>   def conf: HiveConf = SessionState.get().getConf
>>>
>>> Meaning SessionState.get() returned null.
>>>
>>> On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] <shipper_...@bah.com>
>>> wrote:
>>>
>>>> I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m
>>>> getting a NullPointerException from HiveContext.  It’s happening while it
>>>> tries to load some tables via JDBC from an external database (not Hive),
>>>> using context.read().jdbc():
>>>>
>>>> —
>>>> java.lang.NullPointerException
>>>> at
>>>> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
>>>> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
>>>> at
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>> at
>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
>>>> at
>>>> org.apache.spark.sql.hive.HiveContext.catalog$lzy

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
It was just renamed recently: https://github.com/apache/spark/pull/10981

As SessionState is entirely managed by Spark’s code, it still seems like this 
is a bug with Spark 1.6.0, and not with how our application is using 
HiveContext.  But I’d feel more confident filing a bug if someone else could 
confirm they’re having this issue with Spark 1.6.0.  Ideally, we should also 
have some simple proof of concept that can be posted with the bug.

From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 3:57 PM
To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE

In ClientWrapper.scala, the SessionState.get().getConf call might have been 
executed ahead of SessionState.start(state) at line 194.

This was the JIRA:

[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL

In master branch, there is no more ClientWrapper.scala

FYI

On Wed, Feb 3, 2016 at 11:15 AM, Shipper, Jay [USA] 
<shipper_...@bah.com<mailto:shipper_...@bah.com>> wrote:
One quick update on this: The NPE is not happening with Spark 1.5.2, so this 
problem seems specific to Spark 1.6.0.

From: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Date: Wednesday, February 3, 2016 at 12:06 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE

Right, I could already tell that from the stack trace and looking at Spark’s 
code.  What I’m trying to determine is why that’s coming back as null now, just 
from upgrading Spark to 1.6.0.

From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 12:04 PM
To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [External] Re: Spark 1.6.0 HiveContext NPE

Looks like the NPE came from this line:
  def conf: HiveConf = SessionState.get().getConf

Meaning SessionState.get() returned null.

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] 
<shipper_...@bah.com<mailto:shipper_...@bah.com>> wrote:
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a 
NullPointerException from HiveContext.  It’s happening while it tries to load 
some tables via JDBC from an external database (not Hive), using 
context.read().jdbc():

—
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
—

Even though the application is not using Hive, HiveContext is used instead of 
SQLContext, for the additional functionality it provides.  There’s no 
hive-site.xml for the application, but this did not cause an issue for Spark 
1.4.1.

Does anyone have an i

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
In ClientWrapper.scala, the SessionState.get().getConf call might have been
executed ahead of SessionState.start(state) at line 194.

This was the JIRA:

[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL

In master branch, there is no more ClientWrapper.scala

FYI

On Wed, Feb 3, 2016 at 11:15 AM, Shipper, Jay [USA] <shipper_...@bah.com>
wrote:

> One quick update on this: The NPE is not happening with Spark 1.5.2, so
> this problem seems specific to Spark 1.6.0.
>
> From: Jay Shipper <shipper_...@bah.com>
> Date: Wednesday, February 3, 2016 at 12:06 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE
>
> Right, I could already tell that from the stack trace and looking at
> Spark’s code.  What I’m trying to determine is why that’s coming back as
> null now, just from upgrading Spark to 1.6.0.
>
> From: Ted Yu <yuzhih...@gmail.com>
> Date: Wednesday, February 3, 2016 at 12:04 PM
> To: Jay Shipper <shipper_...@bah.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: [External] Re: Spark 1.6.0 HiveContext NPE
>
> Looks like the NPE came from this line:
>   def conf: HiveConf = SessionState.get().getConf
>
> Meaning SessionState.get() returned null.
>
> On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] <shipper_...@bah.com>
> wrote:
>
>> I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m
>> getting a NullPointerException from HiveContext.  It’s happening while it
>> tries to load some tables via JDBC from an external database (not Hive),
>> using context.read().jdbc():
>>
>> —
>> java.lang.NullPointerException
>> at
>> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
>> at
>> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
>> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
>> at
>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
>> at
>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
>> at
>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>> at
>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>> at
>> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
>> at
>> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
>> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
>> at
>> org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
>> at
>> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
>> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
>> at
>> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>> at
>> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
>> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
>> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
>> —
>>
>> Even though the application is not using Hive, HiveContext is used
>> instead of SQLContext, for the additional functionality it provides.
>> There’s no hive-site.xml for the application, but this did not cause an
>> issue for Spark 1.4.1.
>>
>> Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that
>> could explain this NPE?  The only obvious change I’ve noticed for
>> HiveContext is that the default warehouse location is different (1.4.1 -
>> current directory, 1.6.0 - /user/hive/warehouse), but I verified that this
>> NPE happens even when /user/hive/warehouse exists and is readable/writeable
>> for the application.  In terms of changes to the application to work with
>> Spark 1.6.0, the only one that might be relevant to this issue is the
>> upgrade in the Hadoop dependencies to match what Spark 1.6.0 uses
>> (2.6.0-cdh5.7.0-SNAPSHOT).
>>
>> Thanks,
>> Jay
>>
>
>


Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
Create a pull request:
https://github.com/apache/spark/pull/11066

FYI

On Wed, Feb 3, 2016 at 1:27 PM, Shipper, Jay [USA] <shipper_...@bah.com>
wrote:

> It was just renamed recently: https://github.com/apache/spark/pull/10981
>
> As SessionState is entirely managed by Spark’s code, it still seems like
> this is a bug with Spark 1.6.0, and not with how our application is using
> HiveContext.  But I’d feel more confident filing a bug if someone else
> could confirm they’re having this issue with Spark 1.6.0.  Ideally, we
> should also have some simple proof of concept that can be posted with the
> bug.
>
> From: Ted Yu <yuzhih...@gmail.com>
> Date: Wednesday, February 3, 2016 at 3:57 PM
> To: Jay Shipper <shipper_...@bah.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE
>
> In ClientWrapper.scala, the SessionState.get().getConf call might have
> been executed ahead of SessionState.start(state) at line 194.
>
> This was the JIRA:
>
> [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL
>
> In master branch, there is no more ClientWrapper.scala
>
> FYI
>
> On Wed, Feb 3, 2016 at 11:15 AM, Shipper, Jay [USA] <shipper_...@bah.com>
> wrote:
>
>> One quick update on this: The NPE is not happening with Spark 1.5.2, so
>> this problem seems specific to Spark 1.6.0.
>>
>> From: Jay Shipper <shipper_...@bah.com>
>> Date: Wednesday, February 3, 2016 at 12:06 PM
>> To: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE
>>
>> Right, I could already tell that from the stack trace and looking at
>> Spark’s code.  What I’m trying to determine is why that’s coming back as
>> null now, just from upgrading Spark to 1.6.0.
>>
>> From: Ted Yu <yuzhih...@gmail.com>
>> Date: Wednesday, February 3, 2016 at 12:04 PM
>> To: Jay Shipper <shipper_...@bah.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: [External] Re: Spark 1.6.0 HiveContext NPE
>>
>> Looks like the NPE came from this line:
>>   def conf: HiveConf = SessionState.get().getConf
>>
>> Meaning SessionState.get() returned null.
>>
>> On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] <shipper_...@bah.com>
>> wrote:
>>
>>> I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m
>>> getting a NullPointerException from HiveContext.  It’s happening while it
>>> tries to load some tables via JDBC from an external database (not Hive),
>>> using context.read().jdbc():
>>>
>>> —
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
>>> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
>>> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
>>> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
>>> at
>>> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>>> at org.apache.sp

Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a 
NullPointerException from HiveContext.  It’s happening while it tries to load 
some tables via JDBC from an external database (not Hive), using 
context.read().jdbc():

—
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
—

Even though the application is not using Hive, HiveContext is used instead of 
SQLContext, for the additional functionality it provides.  There’s no 
hive-site.xml for the application, but this did not cause an issue for Spark 
1.4.1.

Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that could 
explain this NPE?  The only obvious change I’ve noticed for HiveContext is that 
the default warehouse location is different (1.4.1 - current directory, 1.6.0 - 
/user/hive/warehouse), but I verified that this NPE happens even when 
/user/hive/warehouse exists and is readable/writeable for the application.  In 
terms of changes to the application to work with Spark 1.6.0, the only one that 
might be relevant to this issue is the upgrade in the Hadoop dependencies to 
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT).

Thanks,
Jay


Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
Right, I could already tell that from the stack trace and looking at Spark’s 
code.  What I’m trying to determine is why that’s coming back as null now, just 
from upgrading Spark to 1.6.0.

From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 12:04 PM
To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [External] Re: Spark 1.6.0 HiveContext NPE

Looks like the NPE came from this line:
  def conf: HiveConf = SessionState.get().getConf

Meaning SessionState.get() returned null.

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] 
<shipper_...@bah.com<mailto:shipper_...@bah.com>> wrote:
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a 
NullPointerException from HiveContext.  It’s happening while it tries to load 
some tables via JDBC from an external database (not Hive), using 
context.read().jdbc():

—
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
—

Even though the application is not using Hive, HiveContext is used instead of 
SQLContext, for the additional functionality it provides.  There’s no 
hive-site.xml for the application, but this did not cause an issue for Spark 
1.4.1.

Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that could 
explain this NPE?  The only obvious change I’ve noticed for HiveContext is that 
the default warehouse location is different (1.4.1 - current directory, 1.6.0 - 
/user/hive/warehouse), but I verified that this NPE happens even when 
/user/hive/warehouse exists and is readable/writeable for the application.  In 
terms of changes to the application to work with Spark 1.6.0, the only one that 
might be relevant to this issue is the upgrade in the Hadoop dependencies to 
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT).

Thanks,
Jay



Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
Looks like the NPE came from this line:
  def conf: HiveConf = SessionState.get().getConf

Meaning SessionState.get() returned null.

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] <shipper_...@bah.com>
wrote:

> I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m
> getting a NullPointerException from HiveContext.  It’s happening while it
> tries to load some tables via JDBC from an external database (not Hive),
> using context.read().jdbc():
>
> —
> java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
> at
> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
> at
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
> at
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
> at
> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
> —
>
> Even though the application is not using Hive, HiveContext is used instead
> of SQLContext, for the additional functionality it provides.  There’s no
> hive-site.xml for the application, but this did not cause an issue for
> Spark 1.4.1.
>
> Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that
> could explain this NPE?  The only obvious change I’ve noticed for
> HiveContext is that the default warehouse location is different (1.4.1 -
> current directory, 1.6.0 - /user/hive/warehouse), but I verified that this
> NPE happens even when /user/hive/warehouse exists and is readable/writeable
> for the application.  In terms of changes to the application to work with
> Spark 1.6.0, the only one that might be relevant to this issue is the
> upgrade in the Hadoop dependencies to match what Spark 1.6.0 uses
> (2.6.0-cdh5.7.0-SNAPSHOT).
>
> Thanks,
> Jay
>


Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
One quick update on this: The NPE is not happening with Spark 1.5.2, so this 
problem seems specific to Spark 1.6.0.

From: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Date: Wednesday, February 3, 2016 at 12:06 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: [External] Re: Spark 1.6.0 HiveContext NPE

Right, I could already tell that from the stack trace and looking at Spark’s 
code.  What I’m trying to determine is why that’s coming back as null now, just 
from upgrading Spark to 1.6.0.

From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 12:04 PM
To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [External] Re: Spark 1.6.0 HiveContext NPE

Looks like the NPE came from this line:
  def conf: HiveConf = SessionState.get().getConf

Meaning SessionState.get() returned null.

On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] 
<shipper_...@bah.com<mailto:shipper_...@bah.com>> wrote:
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a 
NullPointerException from HiveContext.  It’s happening while it tries to load 
some tables via JDBC from an external database (not Hive), using 
context.read().jdbc():

—
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
at 
org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
—

Even though the application is not using Hive, HiveContext is used instead of 
SQLContext, for the additional functionality it provides.  There’s no 
hive-site.xml for the application, but this did not cause an issue for Spark 
1.4.1.

Does anyone have an idea about what’s changed from 1.4.1 to 1.6.0 that could 
explain this NPE?  The only obvious change I’ve noticed for HiveContext is that 
the default warehouse location is different (1.4.1 - current directory, 1.6.0 - 
/user/hive/warehouse), but I verified that this NPE happens even when 
/user/hive/warehouse exists and is readable/writeable for the application.  In 
terms of changes to the application to work with Spark 1.6.0, the only one that 
might be relevant to this issue is the upgrade in the Hadoop dependencies to 
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT).

Thanks,
Jay



Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
Hi

I am using a shared sparkContext for all of my Spark jobs. Some of the jobs
use HiveContext, but there isn't a getOrCreate method on HiveContext which
will allow reuse of an existing HiveContext. Such a method exists on
SQLContext only (def getOrCreate(sparkContext: SparkContext): SQLContext).

Is there any reason that a HiveContext cannot be shared amongst multiple
threads within the same Spark driver process?

In addition I cannot seem to be able to cast a HiveContext to a SQLContext,
but this works fine in the spark shell, I am doing something wrong here?

scala> sqlContext

res19: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.hive.HiveContext@383b3357

scala> import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.SQLContext

scala> SQLContext.getOrCreate(sc)

res18: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.hive.HiveContext@383b3357



Regards
Deenar


Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Ted Yu
Have you noticed the following method of HiveContext ?

   * Returns a new HiveContext as new session, which will have separated
SQLConf, UDF/UDAF,
   * temporary tables and SessionState, but sharing the same CacheManager,
IsolatedClientLoader
   * and Hive client (both of execution and metadata) with existing
HiveContext.
   */
  override def newSession(): HiveContext = {

Cheers

On Mon, Jan 25, 2016 at 7:22 AM, Deenar Toraskar <deenar.toras...@gmail.com>
wrote:

> Hi
>
> I am using a shared sparkContext for all of my Spark jobs. Some of the
> jobs use HiveContext, but there isn't a getOrCreate method on HiveContext
> which will allow reuse of an existing HiveContext. Such a method exists on
> SQLContext only (def getOrCreate(sparkContext: SparkContext): SQLContext).
>
> Is there any reason that a HiveContext cannot be shared amongst multiple
> threads within the same Spark driver process?
>
> In addition I cannot seem to be able to cast a HiveContext to a
> SQLContext, but this works fine in the spark shell, I am doing something
> wrong here?
>
> scala> sqlContext
>
> res19: org.apache.spark.sql.SQLContext =
> org.apache.spark.sql.hive.HiveContext@383b3357
>
> scala> import org.apache.spark.sql.SQLContext
>
> import org.apache.spark.sql.SQLContext
>
> scala> SQLContext.getOrCreate(sc)
>
> res18: org.apache.spark.sql.SQLContext =
> org.apache.spark.sql.hive.HiveContext@383b3357
>
>
>
> Regards
> Deenar
>


Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
On 25 January 2016 at 21:09, Deenar Toraskar <
deenar.toras...@thinkreactive.co.uk> wrote:

> No I hadn't. This is useful, but in some cases we do want to share the
> same temporary tables between jobs so really wanted a getOrCreate
> equivalent on HIveContext.
>
> Deenar
>
>
>
> On 25 January 2016 at 18:10, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Have you noticed the following method of HiveContext ?
>>
>>* Returns a new HiveContext as new session, which will have separated
>> SQLConf, UDF/UDAF,
>>* temporary tables and SessionState, but sharing the same
>> CacheManager, IsolatedClientLoader
>>* and Hive client (both of execution and metadata) with existing
>> HiveContext.
>>*/
>>   override def newSession(): HiveContext = {
>>
>> Cheers
>>
>> On Mon, Jan 25, 2016 at 7:22 AM, Deenar Toraskar <
>> deenar.toras...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am using a shared sparkContext for all of my Spark jobs. Some of the
>>> jobs use HiveContext, but there isn't a getOrCreate method on HiveContext
>>> which will allow reuse of an existing HiveContext. Such a method exists on
>>> SQLContext only (def getOrCreate(sparkContext: SparkContext):
>>> SQLContext).
>>>
>>> Is there any reason that a HiveContext cannot be shared amongst multiple
>>> threads within the same Spark driver process?
>>>
>>> In addition I cannot seem to be able to cast a HiveContext to a
>>> SQLContext, but this works fine in the spark shell, I am doing something
>>> wrong here?
>>>
>>> scala> sqlContext
>>>
>>> res19: org.apache.spark.sql.SQLContext =
>>> org.apache.spark.sql.hive.HiveContext@383b3357
>>>
>>> scala> import org.apache.spark.sql.SQLContext
>>>
>>> import org.apache.spark.sql.SQLContext
>>>
>>> scala> SQLContext.getOrCreate(sc)
>>>
>>> res18: org.apache.spark.sql.SQLContext =
>>> org.apache.spark.sql.hive.HiveContext@383b3357
>>>
>>>
>>>
>>> Regards
>>> Deenar
>>>
>>
>>
>


How HiveContext can read subdirectories

2016-01-07 Thread Arkadiusz Bicz
Hi,

Can Spark using HiveContext External Tables read sub-directories?

Example:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql._

import sqlContext.implicits._

//prepare data and create subdirectories with parquet
val df = Seq("id1" -> 1, "id2" -> 4, "id3"-> 5).toDF("id", "value")
df.write.parquet("/tmp/df/1")
val df2 = Seq("id6"-> 6, "id7"-> 7, "id8"-> 8).toDF("id", "value")
df2.write.parquet("/tmp/df/2")
val dfall = sqlContext.read.load("/tmp/df/*/")
assert(dfall.count == 6)

//convert to HiveContext
val hc = new HiveContext(sqlContext.sparkContext)

hc.sql("SET hive.mapred.supports.subdirectories=true")
hc.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")

hc.sql("create external table testsubdirectories (id string, value
string) STORED AS PARQUET location '/tmp/df'")

val hcall = hc.sql("select * from testsubdirectories")

assert(hcall.count() == 6)  //shoud return 6 but it is 0 as not read
from subdirectories

Thanks,

Arkadiusz Bicz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  1   2   3   4   >