Re: Adding abstraction in MLlib
Hi Egor, I posted the design doc for pipeline and parameters on the JIRA, now I'm trying to work out some details of ML datasets, which I will post it later this week. You feedback is welcome! Best, Xiangrui On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin r...@databricks.com wrote: Hi Egor, Thanks for the suggestion. It is definitely our intention and practice to post design docs as soon as they are ready, and short iteration cycles. As a matter of fact, we encourage design docs for major features posted before implementation starts, and WIP pull requests before they are fully baked for large features. That said, no, not 100% of a committer's time is on a specific ticket. There are lots of tickets that are open for a long time before somebody starts actively working on it. So no, it is not true that all this time was active development. Xiangrui should post the design doc as soon as it is ready for feedback. On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov pahomov.e...@gmail.com wrote: It's good, that databricks working on this issue! However current process of working on that is not very clear for outsider. Last update on this ticket is August 5. If all this time was active development, I have concerns that without feedback from community for such long time development can fall in wrong way. Even if it would be great big patch as soon as you introduce new interfaces to community it would allow us to start working on our pipeline code. It would allow us write algorithm in new paradigm instead of in lack of any paradigms like it was before. It would allow us to help you transfer old code to new paradigm. My main point - shorter iterations with more transparency. I think it would be good idea to create some pull request with code, which you have so far, even if it doesn't pass tests, so just we can comment on it before formulating it in design doc. 2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com: We typically post design docs on JIRA's before major work starts. For instance, pretty sure SPARk-1856 will have a design doc posted shortly. On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote: Are interface designs being captured anywhere as documents that the community can follow along with as the proposals evolve? I've worked on other open source projects where design docs were published as living documents (e.g. on google docs, or etherpad, but the particular mechanism isn't crucial). FWIW, I found that to be a good way to work in a community environment. - Original Message - Hi Egor, Thanks for the feedback! We are aware of some of the issues you mentioned and there are JIRAs created for them. Specifically, I'm pushing out the design on pipeline features and algorithm/model parameters this week. We can move our discussion to https://issues.apache.org/jira/browse/SPARK-1856 . It would be nice to make tests against interfaces. But it definitely needs more discussion before making PRs. For example, we discussed the learning interfaces in Christoph's PR (https://github.com/apache/spark/pull/2137/) but it takes time to reach a consensus, especially on interfaces. Hopefully all of us could benefit from the discussion. The best practice is to break down the proposal into small independent piece and discuss them on the JIRA before submitting PRs. For performance tests, there is a spark-perf package (https://github.com/databricks/spark-perf) and we added performance tests for MLlib in v1.1. But definitely more work needs to be done. The dev-list may not be a good place for discussion on the design, could you create JIRAs for each of the issues you pointed out, and we track the discussion on JIRA? Thanks! Best, Xiangrui On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote: Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release. On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com wrote: Some architect suggestions on this matter - https://github.com/apache/spark/pull/2371 2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com: Sorry, I misswrote - I meant learners part of framework - models already exists. 2014-09-12 15:53 GMT+04:00 Christoph Sawade christoph.saw...@googlemail.com: I totally agree, and we discovered also some drawbacks with the classification models implementation that are based on GLMs: - There is no distinction between predicting scores, classes, and calibrated scores (probabilities). For these models it is common to have access to all of them and the prediction function ``predict``should be consistent and stateless. Currently, the score is only available after removing the threshold from the
Re: Network Communication - Akka or more?
I'm not familiar with Infiniband, but I can chime in on the Spark part. There are two kinds of communications in Spark: control plane and data plane. Task scheduling / dispatching is control, whereas fetching a block (e.g. shuffle) is data. On Tue, Sep 16, 2014 at 4:22 PM, Trident cw...@vip.qq.com wrote: Thank you for reading this mail. I'm trying to change the underlying network connection system of Spark to support Infiniteband. 1. I doubt whether ConnectionManager and netty is under construction. It seems that they are not usually used. They are used for data plane communication. Broadcast, shuffle, all use them. 2. How much connection payload is carried by akka? Akka is mainly responsible for control, i.e. dispatching tasks, reporting a block being put into memory to the driver etc. 3. When running ./bin/run-example SparkPi I noticed that the jar file has been sent from server to client. It is scary because the jar is big. Is it common? How are you going to distribute the jar file if you don't send it? The workers need to bytecode for those classes you are going to execute.
Re: Workflow Scheduler for Spark
See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie. For example it doesn't sustain spark context like ooyola job server does. Other than GUI interfaces like HUE it's hard to work with - scoozie stopped in development year ago(I spoke with creator) and oozie xml very hard to write. Oozie still have all documentation and code in MR model rather than in yarn model. And based on it's current speed of development I can't expect radical changes in nearest future. There is no Databricks for oozie, which would have people on salary to develop this kind of radical changes. It's dinosaur. Reunold, can you help finding this doc? Do you mean just pipelining spark code or additional logic of persistence tasks, job server, task retry, data availability and extra? 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com: Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com wrote: There are two things we(Yandex) miss in Spark: MLlib good abstractions and good workflow job scheduler. From threads Adding abstraction in MlLib and [mllib] State of Multi-Model training I got the idea, that databricks working on it and we should wait until first post doc, which would lead us. What about workflow scheduler? Is there anyone already working on it? Does anyone have a plan on doing it? P.S. We thought that MLlib abstractions about multiple algorithms run with same data would need such scheduler, which would rerun algorithm in case of failure. I understand, that spark provide fault tolerance out of the box, but we found some Ooozie-like scheduler more reliable for such long living workflows. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
Re: Workflow Scheduler for Spark
It's doc about MLLib pipeline functionality. What about oozie-like workflow? 2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com: See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie. For example it doesn't sustain spark context like ooyola job server does. Other than GUI interfaces like HUE it's hard to work with - scoozie stopped in development year ago(I spoke with creator) and oozie xml very hard to write. Oozie still have all documentation and code in MR model rather than in yarn model. And based on it's current speed of development I can't expect radical changes in nearest future. There is no Databricks for oozie, which would have people on salary to develop this kind of radical changes. It's dinosaur. Reunold, can you help finding this doc? Do you mean just pipelining spark code or additional logic of persistence tasks, job server, task retry, data availability and extra? 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com: Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com wrote: There are two things we(Yandex) miss in Spark: MLlib good abstractions and good workflow job scheduler. From threads Adding abstraction in MlLib and [mllib] State of Multi-Model training I got the idea, that databricks working on it and we should wait until first post doc, which would lead us. What about workflow scheduler? Is there anyone already working on it? Does anyone have a plan on doing it? P.S. We thought that MLlib abstractions about multiple algorithms run with same data would need such scheduler, which would rerun algorithm in case of failure. I understand, that spark provide fault tolerance out of the box, but we found some Ooozie-like scheduler more reliable for such long living workflows. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
network.ConnectionManager error
Hi, When I run spark job on yarn,and the job finished success,but I found there are some error logs in the logfile as follow(the red color text): 14/09/17 18:25:03 INFO ui.SparkUI: Stopped Spark web UI at http://sparkserver2.cn:63937 14/09/17 18:25:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/09/17 18:25:04 INFO network.ConnectionManager: Selector thread was interrupted! 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) not found 14/09/17 18:25:04 WARN network.ConnectionManager: All connections not cleaned up 14/09/17 18:25:04 INFO network.ConnectionManager: ConnectionManager stopped 14/09/17 18:25:04 INFO storage.MemoryStore: MemoryStore cleared 14/09/17 18:25:04 INFO storage.BlockManager: BlockManager stopped 14/09/17 18:25:04 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/09/17 18:25:04 INFO spark.SparkContext: Successfully stopped SparkContext 14/09/17 18:25:04 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/09/17 18:25:04 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 14/09/17 18:25:04 INFO Remoting: Remoting shut down 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. What is the cause of this error? My spark version is 1.1.0 hadoop version is 2.2.0. Thank you.
Re: network.ConnectionManager error
I see the same thing. A workaround is to put a Thread.sleep(5000) statement before sc.stop() Let us know how it goes. On Sep 17, 2014, at 3:43 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I run spark job on yarn,and the job finished success,but I found there are some error logs in the logfile as follow(the red color text): 14/09/17 18:25:03 INFO ui.SparkUI: Stopped Spark web UI at http://sparkserver2.cn:63937 14/09/17 18:25:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/09/17 18:25:04 INFO network.ConnectionManager: Selector thread was interrupted! 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) not found 14/09/17 18:25:04 WARN network.ConnectionManager: All connections not cleaned up 14/09/17 18:25:04 INFO network.ConnectionManager: ConnectionManager stopped 14/09/17 18:25:04 INFO storage.MemoryStore: MemoryStore cleared 14/09/17 18:25:04 INFO storage.BlockManager: BlockManager stopped 14/09/17 18:25:04 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/09/17 18:25:04 INFO spark.SparkContext: Successfully stopped SparkContext 14/09/17 18:25:04 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/09/17 18:25:04 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 14/09/17 18:25:04 INFO Remoting: Remoting shut down 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. What is the cause of this error? My spark version is 1.1.0 hadoop version is 2.2.0. Thank you. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] State of Multi-Model training
This sounds like a pretty major re-write of the system. Is it going to live in an different repo during development? Or will we be able to track progress in the main Spark repo? Kyle On Tue, Sep 16, 2014 at 10:22 PM, Burak Yavuz bya...@stanford.edu wrote: Hi Kyle, Thank you for the code examples. We may be able to use some of the ideas there. I think initially the goal is to have the optimizers ready (SGD, LBFGS), and then the evaluation metrics will come next. It might take some time, however as MLlib is going to have a significant API face-lift (e.g. https://issues.apache.org/jira/browse/SPARK-3530). Evaluation metrics will be significant in the new pipelines and the ability to evaluate multiple models efficiently is very important. We encourage you to read through the design docs, and we would appreciate any feedback from you and the rest of the community! Best, Burak - Original Message - From: Kyle Ellrott kellr...@soe.ucsc.edu To: Burak Yavuz bya...@stanford.edu Cc: dev@spark.apache.org Sent: Tuesday, September 16, 2014 9:41:45 PM Subject: Re: [mllib] State of Multi-Model training I'd be interested in helping to test your code as soon as its available. The version I wrote used a paired RDD and combined by key, it worked best if it used a custom partitioner that put all the samples in the same area. Running things in batched matrices would probably speed things up greatly. You probably won't need my training code, but I did write some stuff related to calculating Binary classifications metric ( https://github.com/apache/spark/pull/1292/files#diff-6) and AUC ( https://github.com/apache/spark/pull/1292/files#diff-5) for multiple models that you might be able to use. Kyle On Tue, Sep 16, 2014 at 4:09 PM, Burak Yavuz bya...@stanford.edu wrote: Hi Kyle, I'm actively working on it now. It's pretty close to completion, I'm just trying to figure out bottlenecks and optimize as much as possible. As Phase 1, I implemented multi model training on Gradient Descent. Instead of performing Vector-Vector operations on rows (examples) and weights, I've batched them into matrices so that we can use Level 3 BLAS to speed things up. I've also added support for Sparse Matrices ( https://github.com/apache/spark/pull/2294) as making use of sparsity will allow you to train more models at once. Best, Burak - Original Message - From: Kyle Ellrott kellr...@soe.ucsc.edu To: dev@spark.apache.org Sent: Tuesday, September 16, 2014 3:21:53 PM Subject: [mllib] State of Multi-Model training I'm curious about the state of development Multi-Model learning in MLlib (training sets of models during the same training session, rather then one at a time). The JIRA lists it as in progress targeting Spark 1.2.0 ( https://issues.apache.org/jira/browse/SPARK-1486 ). But there hasn't been any notes on it in over a month. I submitted a pull request for a possible method to do this work a little over two months ago (https://github.com/apache/spark/pull/1292), but haven't yet received any feedback on the patch yet. Is anybody else working on multi-model training? Kyle
Re: network.ConnectionManager error
This is during shutdown right? Looks ok to me since connections are being closed. We could've handle this more gracefully, but the logs look harmless. On Wednesday, September 17, 2014, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I run spark job on yarn,and the job finished success,but I found there are some error logs in the logfile as follow(the red color text): 14/09/17 18:25:03 INFO ui.SparkUI: Stopped Spark web UI at http://sparkserver2.cn:63937 14/09/17 18:25:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/09/17 18:25:04 INFO network.ConnectionManager: Selector thread was interrupted! 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) not found 14/09/17 18:25:04 WARN network.ConnectionManager: All connections not cleaned up 14/09/17 18:25:04 INFO network.ConnectionManager: ConnectionManager stopped 14/09/17 18:25:04 INFO storage.MemoryStore: MemoryStore cleared 14/09/17 18:25:04 INFO storage.BlockManager: BlockManager stopped 14/09/17 18:25:04 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/09/17 18:25:04 INFO spark.SparkContext: Successfully stopped SparkContext 14/09/17 18:25:04 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/09/17 18:25:04 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 14/09/17 18:25:04 INFO Remoting: Remoting shut down 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. What is the cause of this error? My spark version is 1.1.0 hadoop version is 2.2.0. Thank you.
Re: problem with HiveContext inside Actor
- dev Is it possible that you are constructing more than one HiveContext in a single JVM? Due to global state in Hive code this is not allowed. Michael On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, Du I am not sure what you mean “triggers the HiveContext to create a database”, do you create the sub class of HiveContext? Just be sure you call the “HiveContext.sessionState” eagerly, since it will set the proper “hiveconf” into the SessionState, otherwise the HiveDriver will always get the null value when retrieving HiveConf. Cheng Hao *From:* Du Li [mailto:l...@yahoo-inc.com.INVALID] *Sent:* Thursday, September 18, 2014 7:51 AM *To:* u...@spark.apache.org; dev@spark.apache.org *Subject:* problem with HiveContext inside Actor Hi, Wonder anybody had similar experience or any suggestion here. I have an akka Actor that processes database requests in high-level messages. Inside this Actor, it creates a HiveContext object that does the actual db work. The main thread creates the needed SparkContext and passes in to the Actor to create the HiveContext. When a message is sent to the Actor, it is processed properly except that, when the message triggers the HiveContext to create a database, it throws a NullPointerException in hive.ql.Driver.java which suggests that its conf variable is not initialized. Ironically, it works fine if my main thread directly calls actor.hiveContext to create the database. The spark version is 1.1.0. Thanks, Du