Hi Bryan For your use case you don't need to have multiple metastores. The default metastore uses embedded Derby <https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase(Derby)>. This cannot be shared amongst multiple processes. Just switch to a metastore that supports multiple connections viz. Networked Derby or mysql. see https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
Deenar *Think Reactive Ltd* [email protected] 07714140812 On 29 October 2015 at 00:56, Bryan <[email protected]> wrote: > Yana, > > My basic use-case is that I want to process streaming data, and publish it > to a persistent spark table. After that I want to make the published data > (results) available via JDBC and spark SQL to drive a web API. That would > seem to require two drivers starting separate HiveContexts (one for > sparksql/jdbc, one for streaming) > > Is there a way to share a hive context between the driver for the thrift > spark SQL instance and the streaming spark driver? A better method to do > this? > > An alternate option might be to create the table in two separate > metastores and simply use the same storage location for the data. That > seems very hacky though, and likely to result in maintenance issues. > > Regards, > > Bryan Jeffrey > ------------------------------ > From: Yana Kadiyska <[email protected]> > Sent: 10/28/2015 8:32 PM > To: Bryan Jeffrey <[email protected]> > Cc: Susan Zhang <[email protected]>; user <[email protected]> > Subject: Re: Spark -- Writing to Partitioned Persistent Table > > For this issue in particular ( ERROR XSDB6: Another instance of Derby may > have already booted the database /spark/spark-1.4.1/metastore_db) -- I > think it depends on where you start your application and HiveThriftserver > from. I've run into a similar issue running a driver app first, which would > create a directory called metastore_db. If I then try to start SparkShell > from the same directory, I will see this exception. So it is like > SPARK-9776. It's not so much that the two are in the same process (as the > bug resolution states) I think you can't run 2 drivers which start a > HiveConext from the same directory. > > > On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey <[email protected]> > wrote: > >> All, >> >> One issue I'm seeing is that I start the thrift server (for jdbc access) >> via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master >> spark://master:7077 --hiveconf "spark.cores.max=2" >> >> After about 40 seconds the Thrift server is started and available on >> default port 10000. >> >> I then submit my application - and the application throws the following >> error: >> >> Caused by: java.sql.SQLException: Failed to start database 'metastore_db' >> with class loader >> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721, >> see the next exception for details. >> at >> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown >> Source) >> at >> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown >> Source) >> ... 86 more >> Caused by: java.sql.SQLException: Another instance of Derby may have >> already booted the database /spark/spark-1.4.1/metastore_db. >> at >> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown >> Source) >> at >> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown >> Source) >> at >> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown >> Source) >> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown >> Source) >> ... 83 more >> Caused by: ERROR XSDB6: Another instance of Derby may have already booted >> the database /spark/spark-1.4.1/metastore_db. >> >> This also happens if I do the opposite (submit the application first, and >> then start the thrift server). >> >> It looks similar to the following issue -- but not quite the same: >> https://issues.apache.org/jira/browse/SPARK-9776 >> >> It seems like this set of steps works fine if the metadata database is >> not yet created - but once it's created this happens every time. Is this a >> known issue? Is there a workaround? >> >> Regards, >> >> Bryan Jeffrey >> >> On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <[email protected]> >> wrote: >> >>> Susan, >>> >>> I did give that a shot -- I'm seeing a number of oddities: >>> >>> (1) 'Partition By' appears only accepts alphanumeric lower case fields. >>> It will work for 'machinename', but not 'machineName' or 'machine_name'. >>> (2) When partitioning with maps included in the data I get odd string >>> conversion issues >>> (3) When partitioning without maps I see frequent out of memory issues >>> >>> I'll update this email when I've got a more concrete example of problems. >>> >>> Regards, >>> >>> Bryan Jeffrey >>> >>> >>> >>> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <[email protected]> >>> wrote: >>> >>>> Have you tried partitionBy? >>>> >>>> Something like >>>> >>>> hiveWindowsEvents.foreachRDD( rdd => { >>>> val eventsDataFrame = rdd.toDF() >>>> eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" >>>> windows_event_time_bin").saveAsTable("windows_event") >>>> }) >>>> >>>> >>>> >>>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <[email protected] >>>> > wrote: >>>> >>>>> Hello. >>>>> >>>>> I am working to get a simple solution working using Spark SQL. I am >>>>> writing streaming data to persistent tables using a HiveContext. Writing >>>>> to a persistent non-partitioned table works well - I update the table >>>>> using >>>>> Spark streaming, and the output is available via Hive Thrift/JDBC. >>>>> >>>>> I create a table that looks like the following: >>>>> >>>>> 0: jdbc:hive2://localhost:10000> describe windows_event; >>>>> describe windows_event; >>>>> +--------------------------+---------------------+----------+ >>>>> | col_name | data_type | comment | >>>>> +--------------------------+---------------------+----------+ >>>>> | target_entity | string | NULL | >>>>> | target_entity_type | string | NULL | >>>>> | date_time_utc | timestamp | NULL | >>>>> | machine_ip | string | NULL | >>>>> | event_id | string | NULL | >>>>> | event_data | map<string,string> | NULL | >>>>> | description | string | NULL | >>>>> | event_record_id | string | NULL | >>>>> | level | string | NULL | >>>>> | machine_name | string | NULL | >>>>> | sequence_number | string | NULL | >>>>> | source | string | NULL | >>>>> | source_machine_name | string | NULL | >>>>> | task_category | string | NULL | >>>>> | user | string | NULL | >>>>> | additional_data | map<string,string> | NULL | >>>>> | windows_event_time_bin | timestamp | NULL | >>>>> | # Partition Information | | | >>>>> | # col_name | data_type | comment | >>>>> | windows_event_time_bin | timestamp | NULL | >>>>> +--------------------------+---------------------+----------+ >>>>> >>>>> >>>>> However, when I create a partitioned table and write data using the >>>>> following: >>>>> >>>>> hiveWindowsEvents.foreachRDD( rdd => { >>>>> val eventsDataFrame = rdd.toDF() >>>>> >>>>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event") >>>>> }) >>>>> >>>>> The data is written as though the table is not partitioned (so >>>>> everything is written to >>>>> /user/hive/warehouse/windows_event/file.gz.paquet. Because the data is >>>>> not >>>>> following the partition schema, it is not accessible (and not >>>>> partitioned). >>>>> >>>>> Is there a straightforward way to write to partitioned tables using >>>>> Spark SQL? I understand that the read performance for partitioned data is >>>>> far better - are there other performance improvements that might be better >>>>> to use instead of partitioning? >>>>> >>>>> Regards, >>>>> >>>>> Bryan Jeffrey >>>>> >>>> >>>> >>> >> >
