Re: Spark -- Writing to Partitioned Persistent Table

Deenar Toraskar Thu, 29 Oct 2015 05:15:09 -0700

Hi Bryan

For your use case you don't need to have multiple metastores. The default
metastore uses embedded Derby
<https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase(Derby)>.
This cannot be shared amongst multiple processes. Just switch to a
metastore that supports multiple connections viz. Networked Derby or mysql.
see https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode


Deenar


*Think Reactive Ltd*
[email protected]
07714140812


On 29 October 2015 at 00:56, Bryan <[email protected]> wrote:

> Yana,
>
> My basic use-case is that I want to process streaming data, and publish it
> to a persistent spark table. After that I want to make the published data
> (results) available via JDBC and spark SQL to drive a web API. That would
> seem to require two drivers starting separate HiveContexts (one for
> sparksql/jdbc, one for streaming)
>
> Is there a way to share a hive context between the driver for the thrift
> spark SQL instance and the streaming spark driver? A better method to do
> this?
>
> An alternate option might be to create the table in two separate
> metastores and simply use the same storage location for the data. That
> seems very hacky though, and likely to result in maintenance issues.
>
> Regards,
>
> Bryan Jeffrey
> ------------------------------
> From: Yana Kadiyska <[email protected]>
> Sent: ‎10/‎28/‎2015 8:32 PM
> To: Bryan Jeffrey <[email protected]>
> Cc: Susan Zhang <[email protected]>; user <[email protected]>
> Subject: Re: Spark -- Writing to Partitioned Persistent Table
>
> For this issue in particular ( ERROR XSDB6: Another instance of Derby may
> have already booted the database /spark/spark-1.4.1/metastore_db) -- I
> think it depends on where you start your application and HiveThriftserver
> from. I've run into a similar issue running a driver app first, which would
> create a directory called metastore_db. If I then try to start SparkShell
> from the same directory, I will see this exception. So it is like
> SPARK-9776. It's not so much that the two are in the same process (as the
> bug resolution states) I think you can't run 2 drivers which start a
> HiveConext from the same directory.
>
>
> On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey <[email protected]>
> wrote:
>
>> All,
>>
>> One issue I'm seeing is that I start the thrift server (for jdbc access)
>> via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master
>> spark://master:7077 --hiveconf "spark.cores.max=2"
>>
>> After about 40 seconds the Thrift server is started and available on
>> default port 10000.
>>
>> I then submit my application - and the application throws the following
>> error:
>>
>> Caused by: java.sql.SQLException: Failed to start database 'metastore_db'
>> with class loader
>> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721,
>> see the next exception for details.
>>         at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>> Source)
>>         at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>> Source)
>>         ... 86 more
>> Caused by: java.sql.SQLException: Another instance of Derby may have
>> already booted the database /spark/spark-1.4.1/metastore_db.
>>         at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>> Source)
>>         at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>> Source)
>>         at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
>> Source)
>>         at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
>> Source)
>>         ... 83 more
>> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
>> the database /spark/spark-1.4.1/metastore_db.
>>
>> This also happens if I do the opposite (submit the application first, and
>> then start the thrift server).
>>
>> It looks similar to the following issue -- but not quite the same:
>> https://issues.apache.org/jira/browse/SPARK-9776
>>
>> It seems like this set of steps works fine if the metadata database is
>> not yet created - but once it's created this happens every time.  Is this a
>> known issue? Is there a workaround?
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>> On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <[email protected]>
>> wrote:
>>
>>> Susan,
>>>
>>> I did give that a shot -- I'm seeing a number of oddities:
>>>
>>> (1) 'Partition By' appears only accepts alphanumeric lower case fields.
>>> It will work for 'machinename', but not 'machineName' or 'machine_name'.
>>> (2) When partitioning with maps included in the data I get odd string
>>> conversion issues
>>> (3) When partitioning without maps I see frequent out of memory issues
>>>
>>> I'll update this email when I've got a more concrete example of problems.
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>>
>>>
>>> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <[email protected]>
>>> wrote:
>>>
>>>> Have you tried partitionBy?
>>>>
>>>> Something like
>>>>
>>>> hiveWindowsEvents.foreachRDD( rdd => {
>>>>       val eventsDataFrame = rdd.toDF()
>>>>       eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
>>>> windows_event_time_bin").saveAsTable("windows_event")
>>>>     })
>>>>
>>>>
>>>>
>>>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <[email protected]
>>>> > wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> I am working to get a simple solution working using Spark SQL.  I am
>>>>> writing streaming data to persistent tables using a HiveContext.  Writing
>>>>> to a persistent non-partitioned table works well - I update the table 
>>>>> using
>>>>> Spark streaming, and the output is available via Hive Thrift/JDBC.
>>>>>
>>>>> I create a table that looks like the following:
>>>>>
>>>>> 0: jdbc:hive2://localhost:10000> describe windows_event;
>>>>> describe windows_event;
>>>>> +--------------------------+---------------------+----------+
>>>>> |         col_name         |      data_type      | comment  |
>>>>> +--------------------------+---------------------+----------+
>>>>> | target_entity            | string              | NULL     |
>>>>> | target_entity_type       | string              | NULL     |
>>>>> | date_time_utc            | timestamp           | NULL     |
>>>>> | machine_ip               | string              | NULL     |
>>>>> | event_id                 | string              | NULL     |
>>>>> | event_data               | map<string,string>  | NULL     |
>>>>> | description              | string              | NULL     |
>>>>> | event_record_id          | string              | NULL     |
>>>>> | level                    | string              | NULL     |
>>>>> | machine_name             | string              | NULL     |
>>>>> | sequence_number          | string              | NULL     |
>>>>> | source                   | string              | NULL     |
>>>>> | source_machine_name      | string              | NULL     |
>>>>> | task_category            | string              | NULL     |
>>>>> | user                     | string              | NULL     |
>>>>> | additional_data          | map<string,string>  | NULL     |
>>>>> | windows_event_time_bin   | timestamp           | NULL     |
>>>>> | # Partition Information  |                     |          |
>>>>> | # col_name               | data_type           | comment  |
>>>>> | windows_event_time_bin   | timestamp           | NULL     |
>>>>> +--------------------------+---------------------+----------+
>>>>>
>>>>>
>>>>> However, when I create a partitioned table and write data using the
>>>>> following:
>>>>>
>>>>>     hiveWindowsEvents.foreachRDD( rdd => {
>>>>>       val eventsDataFrame = rdd.toDF()
>>>>>
>>>>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
>>>>>     })
>>>>>
>>>>> The data is written as though the table is not partitioned (so
>>>>> everything is written to
>>>>> /user/hive/warehouse/windows_event/file.gz.paquet.  Because the data is 
>>>>> not
>>>>> following the partition schema, it is not accessible (and not 
>>>>> partitioned).
>>>>>
>>>>> Is there a straightforward way to write to partitioned tables using
>>>>> Spark SQL?  I understand that the read performance for partitioned data is
>>>>> far better - are there other performance improvements that might be better
>>>>> to use instead of partitioning?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Bryan Jeffrey
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark -- Writing to Partitioned Persistent Table

Reply via email to