Unable to connect to Spark thrift JDBC server with pluggable authentication

2014-10-17 Thread Jenny Zhao
Hi,

if Spark thrift JDBC server is started with non-secure mode, it is working
fine. with a secured mode in case of pluggable authentication, I placed the
authentication class configuration in conf/hive-site.xml

 property
  namehive.server2.authentication/name
  valueCUSTOM/value
 /property
 property
  namehive.server2.custom.authentication.class/name
 
valueorg.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl/value
 /property

and the jar containing the implementation is in Spark classpath, still
getting exception, it seems to me it couldn't find the authentication class
I specified in the configuration:

14/10/17 12:44:33 ERROR server.TThreadPoolServer: Error occurred during
processing of message.
java.lang.RuntimeException: java.lang.NoSuchMethodException:
org.apache.hive.service.auth.PasswdAuthenticationProvider.init()
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at
org.apache.hive.service.auth.CustomAuthenticationProviderImpl.init(CustomAuthenticationProviderImpl.java:38)
at
org.apache.hive.service.auth.AuthenticationProviderFactory.getAuthenticationProvider(AuthenticationProviderFactory.java:57)
at
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:61)
at
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:127)
at
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:509)
at
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:264)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:853)
Caused by: java.lang.NoSuchMethodException:
org.apache.hive.service.auth.PasswdAuthenticationProvider.init()
at java.lang.Class.throwNoSuchMethodException(Class.java:367)
at java.lang.Class.getDeclaredConstructor(Class.java:541)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)

why is that?

Thanks for your help!

Jenny


Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Jenny Zhao
Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under
$HIVE_HOME/conf.

through hive cli, I don't see any problem. but for spark on yarn-cluster
mode, I am not able to switch to a database other than the default one, for
Yarn-client mode, it works fine.

Thanks!

Jenny


On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai huaiyin@gmail.com wrote:

 Hi Jenny,

 Have you copied hive-site.xml to spark/conf directory? If not, can you
 put it in conf/ and try again?

 Thanks,

 Yin


 On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao linlin200...@gmail.com
 wrote:


 Thanks Yin!

 here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't
 experience problem connecting to the metastore through hive. which uses DB2
 as metastore database.

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 !--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the License); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an AS IS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
See the License for the specific language governing permissions and
limitations under the License.
 --
 configuration
  property
   namehive.hwi.listen.port/name
   value/value
  /property
  property
   namehive.querylog.location/name
   value/var/ibm/biginsights/hive/query/${user.name}/value
  /property
  property
   namehive.metastore.warehouse.dir/name
   value/biginsights/hive/warehouse/value
  /property
  property
   namehive.hwi.war.file/name
   valuelib/hive-hwi-0.12.0.war/value
  /property
  property
   namehive.metastore.metrics.enabled/name
   valuetrue/value
  /property
  property
   namejavax.jdo.option.ConnectionURL/name
   valuejdbc:db2://hdtest022.svl.ibm.com:50001/BIDB/value
  /property
  property
   namejavax.jdo.option.ConnectionDriverName/name
   valuecom.ibm.db2.jcc.DB2Driver/value
  /property
  property
   namehive.stats.autogather/name
   valuefalse/value
  /property
  property
   namejavax.jdo.mapping.Schema/name
   valueHIVE/value
  /property
  property
   namejavax.jdo.option.ConnectionUserName/name
   valuecatalog/value
  /property
  property
   namejavax.jdo.option.ConnectionPassword/name
   valueV2pJNWMxbFlVbWhaZHowOQ==/value
  /property
  property
   namehive.metastore.password.encrypt/name
   valuetrue/value
  /property
  property
   nameorg.jpox.autoCreateSchema/name
   valuetrue/value
  /property
  property
   namehive.server2.thrift.min.worker.threads/name
   value5/value
  /property
  property
   namehive.server2.thrift.max.worker.threads/name
   value100/value
  /property
  property
   namehive.server2.thrift.port/name
   value1/value
  /property
  property
   namehive.server2.thrift.bind.host/name
   valuehdtest022.svl.ibm.com/value
  /property
  property
   namehive.server2.authentication/name
   valueCUSTOM/value
  /property
  property
   namehive.server2.custom.authentication.class/name

 valueorg.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl/value
  /property
  property
   namehive.server2.enable.impersonation/name
   valuetrue/value
  /property
  property
   namehive.security.webconsole.url/name
   valuehttp://hdtest022.svl.ibm.com:8080/value
  /property
  property
   namehive.security.authorization.enabled/name
   valuetrue/value
  /property
  property
   namehive.security.authorization.createtable.owner.grants/name
   valueALL/value
  /property
 /configuration



 On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai huaiyin@gmail.com wrote:

 Hi Jenny,

 How's your metastore configured for both Hive and Spark SQL? Which
 metastore mode are you using (based on
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
 )?

 Thanks,

 Yin


 On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao linlin200...@gmail.com
 wrote:



 you can reproduce this issue with the following steps (assuming you
 have Yarn cluster + Hive 12):

 1) using hive shell, create a database, e.g: create database ttt

 2) write a simple spark sql program

 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql._
 import org.apache.spark.sql.hive.HiveContext

 object HiveSpark {
   case class Record(key: Int, value: String)

   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(HiveSpark)
 val sc = new SparkContext(sparkConf)

 // A hive context creates an instance of the Hive Metastore in
 process,
 val hiveContext = new HiveContext(sc)
 import hiveContext._

 hql(use ttt

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
Thanks Yin!

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't
experience problem connecting to the metastore through hive. which uses DB2
as metastore database.

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the License); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an AS IS BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
--
configuration
 property
  namehive.hwi.listen.port/name
  value/value
 /property
 property
  namehive.querylog.location/name
  value/var/ibm/biginsights/hive/query/${user.name}/value
 /property
 property
  namehive.metastore.warehouse.dir/name
  value/biginsights/hive/warehouse/value
 /property
 property
  namehive.hwi.war.file/name
  valuelib/hive-hwi-0.12.0.war/value
 /property
 property
  namehive.metastore.metrics.enabled/name
  valuetrue/value
 /property
 property
  namejavax.jdo.option.ConnectionURL/name
  valuejdbc:db2://hdtest022.svl.ibm.com:50001/BIDB/value
 /property
 property
  namejavax.jdo.option.ConnectionDriverName/name
  valuecom.ibm.db2.jcc.DB2Driver/value
 /property
 property
  namehive.stats.autogather/name
  valuefalse/value
 /property
 property
  namejavax.jdo.mapping.Schema/name
  valueHIVE/value
 /property
 property
  namejavax.jdo.option.ConnectionUserName/name
  valuecatalog/value
 /property
 property
  namejavax.jdo.option.ConnectionPassword/name
  valueV2pJNWMxbFlVbWhaZHowOQ==/value
 /property
 property
  namehive.metastore.password.encrypt/name
  valuetrue/value
 /property
 property
  nameorg.jpox.autoCreateSchema/name
  valuetrue/value
 /property
 property
  namehive.server2.thrift.min.worker.threads/name
  value5/value
 /property
 property
  namehive.server2.thrift.max.worker.threads/name
  value100/value
 /property
 property
  namehive.server2.thrift.port/name
  value1/value
 /property
 property
  namehive.server2.thrift.bind.host/name
  valuehdtest022.svl.ibm.com/value
 /property
 property
  namehive.server2.authentication/name
  valueCUSTOM/value
 /property
 property
  namehive.server2.custom.authentication.class/name

valueorg.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl/value
 /property
 property
  namehive.server2.enable.impersonation/name
  valuetrue/value
 /property
 property
  namehive.security.webconsole.url/name
  valuehttp://hdtest022.svl.ibm.com:8080/value
 /property
 property
  namehive.security.authorization.enabled/name
  valuetrue/value
 /property
 property
  namehive.security.authorization.createtable.owner.grants/name
  valueALL/value
 /property
/configuration



On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai huaiyin@gmail.com wrote:

 Hi Jenny,

 How's your metastore configured for both Hive and Spark SQL? Which
 metastore mode are you using (based on
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
 )?

 Thanks,

 Yin


 On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao linlin200...@gmail.com
 wrote:



 you can reproduce this issue with the following steps (assuming you have
 Yarn cluster + Hive 12):

 1) using hive shell, create a database, e.g: create database ttt

 2) write a simple spark sql program

 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql._
 import org.apache.spark.sql.hive.HiveContext

 object HiveSpark {
   case class Record(key: Int, value: String)

   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(HiveSpark)
 val sc = new SparkContext(sparkConf)

 // A hive context creates an instance of the Hive Metastore in
 process,
 val hiveContext = new HiveContext(sc)
 import hiveContext._

 hql(use ttt)
 hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
 hql(LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src)

 // Queries are expressed in HiveQL
 println(Result of 'SELECT *': )
 hql(SELECT * FROM src).collect.foreach(println)
 sc.stop()
   }
 }
 3) run it in yarn-cluster mode.


 On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian lian.cs@gmail.com
 wrote:

 Since you were using hql(...), it’s probably not related to JDBC
 driver. But I failed to reproduce this issue locally with a single node
 pseudo distributed YARN cluster. Would you mind to elaborate more about
 steps to reproduce this bug? Thanks
 ​


 On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian

Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-08 Thread Jenny Zhao
Hi,

I am able to run my hql query on yarn cluster mode when connecting to the
default hive metastore defined in hive-site.xml.

however, if I want to switch to a different database, like:

  hql(use other-database)


it only works in yarn client mode, but failed on yarn-cluster mode with the
following stack:

14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
14/08/08 12:09:11 INFO audit:
ugi=biadmin ip=unknown-ip-addr  cmd=get_database: tt
14/08/08 12:09:11 ERROR RetryingHMSHandler:
NoSuchObjectException(message:There is no database named tt)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at $Proxy15.getDatabase(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at $Proxy17.get_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at $Proxy18.getDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
at 
org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
at 
org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)

14/08/08 12:09:11 ERROR DDLTask:
org.apache.hadoop.hive.ql.metadata.HiveException: Database does not
exist: tt
at 
org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at 

Spark sql with hive table running on Yarn-cluster mode

2014-07-22 Thread Jenny Zhao
Hi,

For running spark sql, the dataneuclus*.jar are automatically added in
classpath, this works fine for spark standalone mode and yarn-client mode,
however, for Yarn-cluster mode, I have to explicitly put these jars using
--jars option when submitting this job, otherwise, the job will fail, why
it won't  work for yarn-cluster mode?

Thank you for your help!

Jenny


Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Jenny Zhao
Thanks Michael!

as I run it using spark-shell, so I added both jars through bin/spark-shell
--jars options.  I noticed if I don't pass these jars, it complains it
couldn't find the driver, if I pass them through --jars options, it
complains there is no suitable driver.

Regards.


On Tue, Jun 17, 2014 at 2:43 AM, Michael Armbrust mich...@databricks.com
wrote:

 First a clarification:  Spark SQL does not talk to HiveServer2, as that
 JDBC interface is for retrieving results from queries that are executed
 using Hive.  Instead Spark SQL will execute queries itself by directly
 accessing your data using Spark.

 Spark SQL's Hive module can use JDBC to connect to an external metastore,
 in your case DB2. This is only used to retrieve the metadata (i.e., column
 names and types, HDFS locations for data)

 Looking at your exception I still see java.sql.SQLException: No suitable
 driver, so my guess would be that the DB2 JDBC drivers are not being
 correctly included.  How are you trying to add them to the classpath?

 Michael


 On Tue, Jun 17, 2014 at 1:29 AM, Jenny Zhao linlin200...@gmail.com
 wrote:


 Hi,

 my hive configuration use db2 as it's metastore database, I have built
 spark with the extra step sbt/sbt assembly/assembly to include the
 dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf.
 when I ran :

 hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

 got following exception, pasted portion of the stack trace here, looking
 at the stack, this made me wondering if Spark supports remote metastore
 configuration, it seems spark doesn't talk to hiveserver2 directly?  the
 driver jars: db2jcc-10.5.jar, db2jcc_license_cisuz-10.5.jar both are
 included in the classpath, otherwise, it will complain it couldn't find the
 driver.

 Appreciate any help to resolve it.

 Thanks!

 Caused by: java.sql.SQLException: Unable to open a test connection to the
 given database. JDBC url = jdbc:db2://localhost:50001/BIDB, username =
 catalog. Terminating connection pool. Original Exception: --
 java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getConnection(DriverManager.java:422)
 at java.sql.DriverManager.getConnection(DriverManager.java:374)
 at
 com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:254)
 at com.jolbox.bonecp.BoneCP.init(BoneCP.java:305)
 at
 com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150)
 at
 com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:479)
 at
 org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:304)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
 at
 org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
 at
 org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
 at
 org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069)
 at
 org.datanucleus.NucleusContext.initialise(NucleusContext.java:359)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
 at
 java.security.AccessController.doPrivileged(AccessController.java:277)
 at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
 at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304

Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Jenny Zhao
finally got it work out, mimicked how spark added datanucleus jars in
compute-classpath.sh, and added the db2jcc*.jar in the classpath, it works
now.

Thanks!


On Tue, Jun 17, 2014 at 10:50 AM, Jenny Zhao linlin200...@gmail.com wrote:

 Thanks Michael!

 as I run it using spark-shell, so I added both jars through
 bin/spark-shell --jars options.  I noticed if I don't pass these jars, it
 complains it couldn't find the driver, if I pass them through --jars
 options, it complains there is no suitable driver.

 Regards.


 On Tue, Jun 17, 2014 at 2:43 AM, Michael Armbrust mich...@databricks.com
 wrote:

 First a clarification:  Spark SQL does not talk to HiveServer2, as that
 JDBC interface is for retrieving results from queries that are executed
 using Hive.  Instead Spark SQL will execute queries itself by directly
 accessing your data using Spark.

 Spark SQL's Hive module can use JDBC to connect to an external metastore,
 in your case DB2. This is only used to retrieve the metadata (i.e., column
 names and types, HDFS locations for data)

 Looking at your exception I still see java.sql.SQLException: No
 suitable driver, so my guess would be that the DB2 JDBC drivers are not
 being correctly included.  How are you trying to add them to the classpath?

 Michael


 On Tue, Jun 17, 2014 at 1:29 AM, Jenny Zhao linlin200...@gmail.com
 wrote:


 Hi,

 my hive configuration use db2 as it's metastore database, I have built
 spark with the extra step sbt/sbt assembly/assembly to include the
 dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf.
 when I ran :

 hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

 got following exception, pasted portion of the stack trace here, looking
 at the stack, this made me wondering if Spark supports remote metastore
 configuration, it seems spark doesn't talk to hiveserver2 directly?  the
 driver jars: db2jcc-10.5.jar, db2jcc_license_cisuz-10.5.jar both are
 included in the classpath, otherwise, it will complain it couldn't find the
 driver.

 Appreciate any help to resolve it.

 Thanks!

 Caused by: java.sql.SQLException: Unable to open a test connection to
 the given database. JDBC url = jdbc:db2://localhost:50001/BIDB, username =
 catalog. Terminating connection pool. Original Exception: --
 java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getConnection(DriverManager.java:422)
 at java.sql.DriverManager.getConnection(DriverManager.java:374)
 at
 com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:254)
 at com.jolbox.bonecp.BoneCP.init(BoneCP.java:305)
 at
 com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150)
 at
 com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:479)
 at
 org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:304)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
 at
 java.lang.reflect.Constructor.newInstance(Constructor.java:527)
 at
 org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
 at
 org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
 at
 org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069)
 at
 org.datanucleus.NucleusContext.initialise(NucleusContext.java:359)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
 at
 java.security.AccessController.doPrivileged(AccessController.java:277)
 at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
 at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java

Spark sql unable to connect to db2 hive metastore

2014-06-16 Thread Jenny Zhao
Hi,

my hive configuration use db2 as it's metastore database, I have built
spark with the extra step sbt/sbt assembly/assembly to include the
dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf.
when I ran :

hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

got following exception, pasted portion of the stack trace here, looking at
the stack, this made me wondering if Spark supports remote metastore
configuration, it seems spark doesn't talk to hiveserver2 directly?  the
driver jars: db2jcc-10.5.jar, db2jcc_license_cisuz-10.5.jar both are
included in the classpath, otherwise, it will complain it couldn't find the
driver.

Appreciate any help to resolve it.

Thanks!

Caused by: java.sql.SQLException: Unable to open a test connection to the
given database. JDBC url = jdbc:db2://localhost:50001/BIDB, username =
catalog. Terminating connection pool. Original Exception: --
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getConnection(DriverManager.java:422)
at java.sql.DriverManager.getConnection(DriverManager.java:374)
at
com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:254)
at com.jolbox.bonecp.BoneCP.init(BoneCP.java:305)
at
com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150)
at
com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:479)
at
org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:304)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
at
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at
org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069)
at
org.datanucleus.NucleusContext.initialise(NucleusContext.java:359)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at
java.security.AccessController.doPrivileged(AccessController.java:277)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:234)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:209)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RetryingRawStore.init(RetryingRawStore.java:64)
at
org.apache.hadoop.hive.metastore.RetryingRawStore.getProxy(RetryingRawStore.java:73)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:415)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:402)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:286)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
   

Re: Invalid Class Exception

2014-06-06 Thread Jenny Zhao
we experienced similar issue in our environment, below is the whole stack
trace,  it works fine if we run local mode, if we run it in cluster mode
(even with Master and 1 worker on the same node), we have this
serialversionUID issue. we use Spark 1.0.0 and compiled with JDK6.

here is a link about serialVersionUID and suggestion on using it for
Serializable class.. which suggests to define a serialVersionUID in the
serializable class
http://stackoverflow.com/questions/285793/what-is-a-serialversionuid-and-why-should-i-use-it


14/06/05 09:52:18 WARN scheduler.TaskSetManager: Lost TID 9 (task 1.0:9)
14/06/05 09:52:18 WARN scheduler.TaskSetManager: Loss was due to
java.io.InvalidClassException
java.io.InvalidClassException: org.apache.spark.SerializableWritable; local
class incompatible: stream classdesc serialVersionUID =
6301214776158303468, local class serialVersionUID = -7785455416944904980
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:630)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1513)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1749)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:365)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at
org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1039)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:365)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1039)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:365)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at
org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
at
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
at
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1809)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1768)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:365)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:195)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at 

configure spark history server for running on Yarn

2014-05-01 Thread Jenny Zhao
Hi,

I have installed spark 1.0 from the branch-1.0, build went fine, and I have
tried running the example on Yarn client mode, here is my command:

/home/hadoop/spark-branch-1.0/bin/spark-submit
/home/hadoop/spark-branch-1.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar
--master yarn --deploy-mode client --executor-memory 6g --executor-cores 3
--driver-memory 3g --name SparkPi --num-executors 2 --class
org.apache.spark.examples.SparkPi yarn-client 5

after the run, I was not being able to retrieve the log from Yarn's web UI,
while I have tried to specify the history server in spark-env.sh

export SPARK_DAEMON_JAVA_OPTS=-Dspark.yarn.historyServer.address=
master:18080 http://hdtest022.svl.ibm.com:18080


I also tried to specify it in spark-defaults.conf, doesn't work as well, I
would appreciate if someone can tell me what is the way of specifying it
either in spark-env.sh or spark-defaults.conf, so that this option can be
applied to any spark application.

another thing I found is the usage output for spark-submit is not
complete/not in sync with the online documentation, hope it is addressed
with the formal release.

and is this the latest documentation for spark 1.0?
http://people.csail.mit.edu/matei/spark-unified-docs/running-on-yarn.html

Thank you!


Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi all,

I have been able to run LR in local mode,  but I am facing problem to run
it in cluster mode,  below is the source script, and stack trace when
running it cluster mode, I used sbt package to build the project, not sure
what it is complaining?

another question I have is for LogisticRegression itself:

1) I noticed, the LogisticRegressionWithSGD doesn't ask information about
the input features, for instance, if the feature is scale, norminal or
ordinal, or if MLLib only supports scale features?

2) Trainning error is pretty high even when the iteration is set to very
high, do we have number about the accuracy rate of LR model?

Thank you for your help!

/**
 * Logistic regression
 */
object SparkLogisticRegression {


  def main(args: Array[String]) {
if ( args.length != 3) {
  System.err.println(Usage: SparkLogisticRegression master input
file path number of iterations]  )
  System.exit(1)
}

val numIterations = args(2).toInt;

val sc = new SparkContext(args(0), SparkLogisticRegression,
  System.getenv(SPARK_HOME),
  SparkContext.jarOfClass(this.getClass))

// parse in the input data
val data = sc.textFile(args(1))
val lpoints = data.map{ line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, parts.tail.map( x =
x.toDouble).toArray)
}

// setup LR
val model = LogisticRegressionWithSGD.train(lpoints, numIterations)

val labelPred = lpoints.map { p =
  val pred = model.predict(p.features)
  (p.label, pred)
}

val predErr = labelPred.filter (r = r._1 != r._2).count
println(Training Error:  + predErr.toDouble/lpoints.count +   +
predErr + / + lpoints.count)
 }

}

14/04/09 14:50:48 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
14/04/09 14:50:48 WARN scheduler.TaskSetManager: Loss was due to
java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: SparkLinearRegression$$anonfun$2
at java.lang.Class.forName(Class.java:211)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1609)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1514)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1768)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1988)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:364)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at
org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
at
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
at
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1834)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1793)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:364)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:195)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:906)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:929)
at java.lang.Thread.run(Thread.java:796)
14/04/09 14:50:48 WARN scheduler.TaskSetManager: Lost TID 1 (task 0.0:1)
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Loss was due to
java.lang.ClassNotFoundException: SparkLinearRegression$$anonfun$2
[duplicate 1]
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
2 on executor 1: hdtest022.svl.ibm.com (NODE_LOCAL)
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as
1696 bytes in 0 ms
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
3 on executor 0: hdtest023.svl.ibm.com (NODE_LOCAL)
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
1696 bytes in 0 ms
14/04/09 14:50:48 WARN scheduler.TaskSetManager: Lost TID 3 (task 0.0:0)
14/04/09 14:50:48 INFO scheduler.TaskSetManager: Loss was due to

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi Jagat,

yes, I did specify mllib in build.sbt

name := Spark LogisticRegression

version :=1.0

scalaVersion := 2.10.3

libraryDependencies += org.apache.spark % spark-core_2.10 %
0.9.0-incubating

libraryDependencies += org.apache.spark % spark-mllib_2.10 %
0.9.0-incubating

libraryDependencies += org.apache.hadoop % hadoop-client % 1.2.1

resolvers += Akka Repository at http://repo.akka.io/releases/;



On Wed, Apr 9, 2014 at 3:23 PM, Jagat Singh jagatsi...@gmail.com wrote:

 Hi Jenny,

 How are you packaging your jar.

 Can you please confirm if you have included the Mlib jar inside the fat
 jar you have created for your code.

 libraryDependencies += org.apache.spark % spark-mllib_2.9.3 %
 0.8.1-incubating

 Thanks,

 Jagat Singh


 On Thu, Apr 10, 2014 at 8:05 AM, Jenny Zhao linlin200...@gmail.comwrote:


 Hi all,

 I have been able to run LR in local mode,  but I am facing problem to run
 it in cluster mode,  below is the source script, and stack trace when
 running it cluster mode, I used sbt package to build the project, not sure
 what it is complaining?

 another question I have is for LogisticRegression itself:

 1) I noticed, the LogisticRegressionWithSGD doesn't ask information about
 the input features, for instance, if the feature is scale, norminal or
 ordinal, or if MLLib only supports scale features?

 2) Trainning error is pretty high even when the iteration is set to very
 high, do we have number about the accuracy rate of LR model?

 Thank you for your help!

 /**
  * Logistic regression
  */
 object SparkLogisticRegression {


   def main(args: Array[String]) {
 if ( args.length != 3) {
   System.err.println(Usage: SparkLogisticRegression master input
 file path number of iterations]  )
   System.exit(1)
 }

 val numIterations = args(2).toInt;

 val sc = new SparkContext(args(0), SparkLogisticRegression,
   System.getenv(SPARK_HOME),
   SparkContext.jarOfClass(this.getClass))

 // parse in the input data
 val data = sc.textFile(args(1))
 val lpoints = data.map{ line =
   val parts = line.split(',')
   LabeledPoint(parts(0).toDouble, parts.tail.map( x =
 x.toDouble).toArray)
 }

 // setup LR
 val model = LogisticRegressionWithSGD.train(lpoints, numIterations)

 val labelPred = lpoints.map { p =
   val pred = model.predict(p.features)
   (p.label, pred)
 }

 val predErr = labelPred.filter (r = r._1 != r._2).count
 println(Training Error:  + predErr.toDouble/lpoints.count +   +
 predErr + / + lpoints.count)
  }

 }

 14/04/09 14:50:48 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/04/09 14:50:48 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException: SparkLinearRegression$$anonfun$2
 at java.lang.Class.forName(Class.java:211)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
 at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1609)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1514)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1768)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1988)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:364)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
 at
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
 at
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
 at
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1834)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1793)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:364)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
 at
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:195)
 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:906