[openstack-dev] [nova] Configure overcommit policy

2013-11-11 Thread Alexander Kuznetsov
Hi all,


While studying Hadoop performance in a virtual environment, I found an
interesting problem with Nova scheduling. In OpenStack cluster, we have
overcommit policy, allowing to put on one compute more vms than resources
available for them. While it might be suitable for general types of
workload, this is definitely not the case for Hadoop clusters, which
usually consume 100% of system resources.

Is there any way to tell Nova to schedule specific instances (the ones
which consume 100% of system resources) without overcommitting resources on
compute node?


Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-11-01 Thread Alexander Kuznetsov
Jay. Do you have a plan to add a Savanna (type: Heat::Savanna) and Trove
 (type: Heat::Trove)  providers to the HOT DSL?


On Thu, Oct 31, 2013 at 10:33 PM, Jay Pipes jaypi...@gmail.com wrote:

 On 10/31/2013 01:51 PM, Alexander Kuznetsov wrote:

 Hi Heat, Savanna and Trove teams,

 All this projects have common part related to software configuration
 management.  For creation,  an environment  user should specify a
 hardware parameter for vms:  choose flavor, decide use cinder or not,
 configure networks for virtual machines, choose topology for hole
 deployment. Next step is linking of software parameters with hardware
 specification. From the end user point of view, existence of three
 different places and three different ways (HEAT Hot DSL, Trove
 clustering API and Savanna Hadoop templates) for software configuration
 is not convenient, especially if user want to create an environment
 simultaneously involving components from Savanna, Heat and Trove.

 I can suggest two approaches to overcome this situations:

 Common library in oslo. This approach allows a deep domain specific
 customization. The user will still have 3 places with same UI where user
 should perform configuration actions.

 Heat or some other component for software configuration management. This
 approach is the best for end users. In feature possible will be some
 limitation on deep domain specific customization for configuration
 management.


 I think this would be my preference.

 In other words, describe and orchestrate a Hadoop or Database setup using
 HOT templates and using Heat as the orchestration engine.

 Best,
 -jay

  Heat, Savanna and Trove teams can you comment these ideas, what approach
 are the best?

 Alexander Kuznetsov.


 __**_
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 __**_
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-11-01 Thread Alexander Kuznetsov
On Fri, Nov 1, 2013 at 12:39 AM, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Alexander Kuznetsov's message of 2013-10-31 10:51:54 -0700:
  Hi Heat, Savanna and Trove teams,
 
  All this projects have common part related to software configuration
  management.  For creation,  an environment  user should specify a
 hardware
  parameter for vms:  choose flavor, decide use cinder or not, configure
  networks for virtual machines, choose topology for hole deployment. Next
  step is linking of software parameters with hardware specification. From
  the end user point of view, existence of three different places and three
  different ways (HEAT Hot DSL, Trove clustering API and Savanna Hadoop
  templates) for software configuration is not convenient, especially if
 user
  want to create an environment simultaneously involving components from
  Savanna, Heat and Trove.
 

 I'm having a hard time extracting the problem statement. I _think_ that
 the problem is:

 As a user I want to tune my software for my available hardware.

 So what you're saying is, if you select a flavor that has 4GB of RAM
 for your application, you want to also tell your application that it
 can use 3GB of RAM for an in-memory cache. Likewise, if one has asked
 Trove for an 8GB flavor, they will want to tell it to use 6.5GB of RAM
 for InnoDB buffer cache.

 What you'd like to see is one general pattern to express these types
 of things?

Exactly.


  I can suggest two approaches to overcome this situations:
 
  Common library in oslo. This approach allows a deep domain specific
  customization. The user will still have 3 places with same UI where user
  should perform configuration actions.
 
  Heat or some other component for software configuration management. This
  approach is the best for end users. In feature possible will be some
  limitation on deep domain specific customization for configuration
  management.

 Can you maybe be more concrete with your proposed solutions? The lack
 of a clear problem statement combined with these vague solutions has
 thoroughly confused me.


 Sure. I suggest creating a some library or component for standardization
of  software and hardware configuration. It will contain a validation logic
and parameters lists.

Now Trove, Savanna and Heat all have part related to hardware
configuration. For end user, VMs description should not depend on component
where it will be used.

Here is an example of VM description which could be common for Savanna and
Trove:

{
   flavor_id: 42,
   image_id: ”test”,
   volumes: [{
   # extra contains a domain specific parameters.
   # For instance aim for Savanna
   # could be hdfs-dir or mapreduce-dir.
   # For trove: journal-dir or db-dir.
   extra: {
   aim: hdfs-dir
   },
   size: 10GB,
   filesystem: ext3
 },{
   extra: {
 aim: mapreduce-dir
   },
   size: 5GB,
   filesystem: ext3
 }]
networks: [{
   private-network: some-private-net-id,
   public-network: some-public-net-id
 }]


Also, it will be great if this library or component will standardize some
software configuration parameters, like a credential for database or LDAP.
This greatly simplify integration between different components. For example
if user want process data on Hadoop from Cassandra, user should provide a
database location and credentials to Hadoop. If we have some standard for
both Trove and Savanna, it can be done the same way in both components. An
example for Cassandra could look like that:


{
  type: cassandra,
  host: example.com,
  port: 1234,
  credentials: {
 user: ”test”,
 password: ”123”
  }
}


This parameters names and schema should be the same for different
components referencing a Cassandra server.

 
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [heat] [savanna] [trove] Place for software configuration

2013-10-31 Thread Alexander Kuznetsov
Hi Heat, Savanna and Trove teams,

All this projects have common part related to software configuration
management.  For creation,  an environment  user should specify a hardware
parameter for vms:  choose flavor, decide use cinder or not, configure
networks for virtual machines, choose topology for hole deployment. Next
step is linking of software parameters with hardware specification. From
the end user point of view, existence of three different places and three
different ways (HEAT Hot DSL, Trove clustering API and Savanna Hadoop
templates) for software configuration is not convenient, especially if user
want to create an environment simultaneously involving components from
Savanna, Heat and Trove.

I can suggest two approaches to overcome this situations:

Common library in oslo. This approach allows a deep domain specific
customization. The user will still have 3 places with same UI where user
should perform configuration actions.

Heat or some other component for software configuration management. This
approach is the best for end users. In feature possible will be some
limitation on deep domain specific customization for configuration
management.

Heat, Savanna and Trove teams can you comment these ideas, what approach
are the best?

Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [savanna] Program name and Mission statement

2013-09-16 Thread Alexander Kuznetsov
Another variant *Big Data Processing*. This mission more precise reflects
the Savanna nature as just Data Processing. Also, this name is less
confusing as just Data Processing.


On Tue, Sep 17, 2013 at 12:11 AM, Mike Spreitzer mspre...@us.ibm.comwrote:

 data processing is surely a superset of big data.  Either, by itself,
 is way too vague.  But the wording that many people favor, which I will
 quote again, uses the vague term in a qualified way that makes it
 appropriately specific, IMHO.  Here is the wording again:

 ``To provide a simple, reliable and repeatable mechanism by which to
 deploy Hadoop and related Big Data projects, including management,
 monitoring and processing mechanisms driving further adoption of
 OpenStack.''

 I think that saying related Big Data projects after Hadoop is fairly
 clear.  OTOH, I would not mind replacing Hadoop and related Big Data
 projects with the Hadoop ecosystem.

 Regards,
 Mike

 Matthew Farrellee m...@redhat.com wrote on 09/16/2013 02:39:20 PM:

  From: Matthew Farrellee m...@redhat.com
  To: OpenStack Development Mailing List 
 openstack-dev@lists.openstack.org,
  Date: 09/16/2013 02:40 PM
  Subject: Re: [openstack-dev] [savanna] Program name and Mission statement
 
  IMHO, Big Data is even more nebulous and currently being pulled in many
  directions. Hadoop-as-a-Service may be too narrow. So, something in
  between, such as Data Processing, is a good balance.
 
  Best,
 
 
  matt
 
  On 09/13/2013 08:37 AM, Abhishek Lahiri wrote:
   IMHO data processing is too board , it makes more sense to clarify this
   program as big data as a service or simply
 openstack-Hadoop-as-a-service.
  
   Thanks  Regards
   Abhishek Lahiri
  
   On Sep 12, 2013, at 9:13 PM, Nirmal Ranganathan rnir...@gmail.com
   mailto:rnir...@gmail.com rnir...@gmail.com wrote:
  
  
  
  
   On Wed, Sep 11, 2013 at 8:39 AM, Erik Bergenholtz
   ebergenho...@hortonworks.com 
   mailto:ebergenho...@hortonworks.comebergenho...@hortonworks.com
 
   wrote:
  
  
   On Sep 10, 2013, at 8:50 PM, Jon Maron jma...@hortonworks.com
   mailto:jma...@hortonworks.com jma...@hortonworks.com wrote:
  
   Openstack Big Data Platform
  
  
   On Sep 10, 2013, at 8:39 PM, David Scott
   david.sc...@cloudscaling.com
   mailto:david.sc...@cloudscaling.comdavid.sc...@cloudscaling.com
 wrote:
  
   I vote for 'Open Stack Data'
  
  
   On Tue, Sep 10, 2013 at 5:30 PM, Zhongyue Luo
   zhongyue@intel.com 
   mailto:zhongyue@intel.comzhongyue@intel.com
 wrote:
  
   Why not OpenStack MapReduce? I think that pretty much says
   it all?
  
  
   On Wed, Sep 11, 2013 at 3:54 AM, Glen Campbell
   g...@glenc.io mailto:g...@glenc.io g...@glenc.io
 wrote:
  
   performant isn't a word. Or, if it is, it means
   having performance. I think you mean
 high-performance.
  
  
   On Tue, Sep 10, 2013 at 8:47 AM, Matthew Farrellee
   m...@redhat.com mailto:m...@redhat.comm...@redhat.com
 wrote:
  
   Rough cut -
  
   Program: OpenStack Data Processing
   Mission: To provide the OpenStack community with an
   open, cutting edge, performant and scalable data
   processing stack and associated management
 interfaces.
  
  
   Proposing a slightly different mission:
  
   To provide a simple, reliable and repeatable mechanism by which to
   deploy Hadoop and related Big Data projects, including management,
   monitoring and processing mechanisms driving further adoption of
   OpenStack.
  
  
  
   +1. I liked the data processing aspect as well, since EDP api directly
   relates to that, maybe a combination of both.
  
  
  
   On 09/10/2013 09:26 AM, Sergey Lukjanov wrote:
  
   It sounds too broad IMO. Looks like we need to
   define Mission Statement
   first.
  
   Sincerely yours,
   Sergey Lukjanov
   Savanna Technical Lead
   Mirantis Inc.
  
   On Sep 10, 2013, at 17:09, Alexander Kuznetsov
   akuznet...@mirantis.com
   
   mailto:akuznet...@mirantis.comakuznet...@mirantis.com
 
   
   mailto:akuznetsov@mirantis.__comakuznetsov@mirantis.__com
   
   mailto:akuznet...@mirantis.comakuznet...@mirantis.com
 wrote:
  
   My suggestion OpenStack Data Processing.
  
  
   On Tue, Sep 10, 2013 at 4:15 PM, Sergey
 Lukjanov
   slukja...@mirantis.com
   
   mailto:slukja...@mirantis.comslukja...@mirantis.com
 
   
   mailto:slukja...@mirantis.comslukja...@mirantis.com
   
   mailto:slukja...@mirantis.comslukja

Re: [openstack-dev] [nova] [savanna] Host information for non admin users

2013-09-13 Thread Alexander Kuznetsov
Thanks for your comments let me explain a bit more about Hadoop topology.

In Hadoop 1.2 version,  4 level topologies were introduced: all network,
rack, node group (represent Hadoop nodes on the same compute host in the
simplest case) and node. Usually Hadoop has replication factor 3. In this
case Hadoop placement algorithm is trying to put a HDFS block in the local
node or local node group, second replica should be placed outside the node
group, but on the same rack, and the last replica outside the initial rack.
Topology is defined by the path to vm e.g.

/datacenter1/rack1/host1/vm1
/datacenter1/rack1/host1/vm2
/datacenter1/rack1/host2/vm1
/datacenter1/rack1/host2/vm2
/datacenter1/rack2/host3/vm1
/datacenter1/rack2/host3/vm2


Also, this information will be used for job routing, to place the mapper as
closest as possible to the data.


The main idea to provide this information to Hadoop. Usually it direct
mapping between physical data center structure and Hadoop node placement,
but the case of public center the some abstract names will be fine if this
configuration a reflex a proximity information for Hadoop nodes.


Mike as I understand  holistic scheduler can provide needed information.
Can you give more details about it?


On Fri, Sep 13, 2013 at 11:54 AM, John Garbutt j...@johngarbutt.com wrote:

 Exposing the detailed info in private cloud, sure makes sense. For
 public clouds, not so sure. Would be nice to find something that works
 for both.

 We let the user express their intent through the instance groups api.
 The scheduler will then do a best effort to meet that criteria, using
 its private information. At a courser grain, we have availability
 zones, that you could use to express closeness, and probably often
 give you a good measure of closeness anyway.

 So a Hadoop user could request a several small groups of VMs defined
 in instance groups to be close, and maybe spread across different
 availability zones.

 Would that do the trick? Or does Hadoop/HDFS need a bit more
 granularity than that? Could it look to auto-detect closeness in
 some auto-setup phase, given rough user hints?

 John

 On 13 September 2013 07:40, Alex Glikson glik...@il.ibm.com wrote:
  If I understand correctly, what really matters at least in case of
 Hadoop is
  network proximity between instances.
  Hence, maybe Neutron would be a better fit to provide such information.
 In
  particular, depending on virtual network configuration, having 2
 instances
  on the same node does not guarantee that the network traffic between them
  will be routed within the node.
  Physical layout could be useful for availability-related purposes. But
 even
  then, it should be abstracted in such a way that it will not reveal
 details
  that a cloud provider will typically prefer not to expose. Maybe this
 can be
  done by Ironic -- or a separate/new project (Tuskar sounds related).
 
  Regards,
  Alex
 
 
 
 
  From:Mike Spreitzer mspre...@us.ibm.com
  To:OpenStack Development Mailing List
  openstack-dev@lists.openstack.org,
  Date:13/09/2013 08:54 AM
  Subject:Re: [openstack-dev] [nova] [savanna] Host information for
  nonadminusers
  
 
 
 
  From: Nirmal Ranganathan rnir...@gmail.com
  ...
  Well that's left upto the specific block placement policies in hdfs,
  all we are providing with the topology information is a hint on
  node/rack placement.
 
  Oh, you are looking at the placement of HDFS blocks within the fixed
 storage
  volumes, not choosing where to put the storage volumes.  In that case I
  understand and agree that simply providing identifiers from the
  infrastructure to the middleware (HDFS) will suffice.  Coincidentally my
  group is working on this very example right now in our own environment.
  We
  have a holistic scheduler that is given a whole template to place, and it
  returns placement information.  We imagine, as does Hadoop, a general
  hierarchy in the physical layout, and the holistic scheduler returns, for
  each VM, the path from the root to the VM's host.
 
  Regards,
 
  Mike___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] TC Meeting / Savanna Incubation Follow-Up

2013-09-13 Thread Alexander Kuznetsov
On Thu, Sep 12, 2013 at 7:30 PM, Michael Basnight mbasni...@gmail.comwrote:

 On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:

  Sergey Lukjanov wrote:
 
  [...]
  As you can see, resources provisioning is just one of the features and
 the implementation details are not critical for overall architecture. It
 performs only the first step of the cluster setup. We’ve been considering
 Heat for a while, but ended up direct API calls in favor of speed and
 simplicity. Going forward Heat integration will be done by implementing
 extension mechanism [3] and [4] as part of Icehouse release.
 
  The next part, Hadoop cluster configuration, already extensible and we
 have several plugins - Vanilla, Hortonworks Data Platform and Cloudera
 plugin started too. This allow to unify management of different Hadoop
 distributions under single control plane. The plugins are responsible for
 correct Hadoop ecosystem configuration at already provisioned resources and
 use different Hadoop management tools like Ambari to setup and configure
 all cluster  services, so, there are no actual provisioning configs on
 Savanna side in this case. Savanna and its plugins encapsulate the
 knowledge of Hadoop internals and default configuration for Hadoop services.
 
  My main gripe with Savanna is that it combines (in its upcoming release)
  what sounds like to me two very different services: Hadoop cluster
  provisioning service (like what Trove does for databases) and a
  MapReduce+ data API service (like what Marconi does for queues).
 
  Making it part of the same project (rather than two separate projects,
  potentially sharing the same program) make discussions about shifting
  some of its clustering ability to another library/project more complex
  than they should be (see below).
 
  Could you explain the benefit of having them within the same service,
  rather than two services with one consuming the other ?

 And for the record, i dont think that Trove is the perfect fit for it
 today. We are still working on a clustering API. But when we create it, i
 would love the Savanna team's input, so we can try to make a pluggable API
 thats usable for people who want MySQL or Cassandra or even Hadoop. Im less
 a fan of a clustering library, because in the end, we will both have API
 calls like POST /clusters, GET /clusters, and there will be API duplication
 between the projects.

 I think that Cluster API (if it would be created) will be helpful not only
for Trove and Savanna.  NoSQL, RDBMS and Hadoop are not unique software
which can be clustered. What about different kind of messaging solutions
like RabbitMQ, ActiveMQ or J2EE containers like JBoss, Weblogic and
WebSphere, which often are installed in clustered mode. Messaging,
databases, J2EE containers and Hadoop have their own management cycle. It
will be confusing to make Cluster API a part of Trove which has different
mission - database management and provisioning.

 
  The next topic is “Cluster API”.
 
  The concern that was raised is how to extract general clustering
 functionality to the common library. Cluster provisioning and management
 topic currently relevant for a number of projects within OpenStack
 ecosystem: Savanna, Trove, TripleO, Heat, Taskflow.
 
  Still each of the projects has their own understanding of what the
 cluster provisioning is. The idea of extracting common functionality sounds
 reasonable, but details still need to be worked out.
 
  I’ll try to highlight Savanna team current perspective on this
 question. Notion of “Cluster management” in my perspective has several
 levels:
  1. Resources provisioning and configuration (like instances, networks,
 storages). Heat is the main tool with possibly additional support from
 underlying services. For example, instance grouping API extension [5] in
 Nova would be very useful.
  2. Distributed communication/task execution. There is a project in
 OpenStack ecosystem with the mission to provide a framework for distributed
 task execution - TaskFlow [6]. It’s been started quite recently. In Savanna
 we are really looking forward to use more and more of its functionality in
 I and J cycles as TaskFlow itself getting more mature.
  3. Higher level clustering - management of the actual services working
 on top of the infrastructure. For example, in Savanna configuring HDFS data
 nodes or in Trove setting up MySQL cluster with Percona or Galera. This
 operations are typical very specific for the project domain. As for Savanna
 specifically, we use lots of benefits of Hadoop internals knowledge to
 deploy and configure it properly.
 
  Overall conclusion it seems to be that it make sense to enhance Heat
 capabilities and invest in Taskflow development, leaving domain-specific
 operations to the individual projects.
 
  The thing we'd need to clarify (and the incubation period would be used
  to achieve that) is how to reuse as much as possible between the various
  cluster provisioning projects (Trove, the 

Re: [openstack-dev] TC Meeting / Savanna Incubation Follow-Up

2013-09-13 Thread Alexander Kuznetsov
Hadoop Ecosystem is not only datastore technologies. Hadoop has other
components:  Map Reduce framework, distributed coordinator - Zookepeer,
workflow management - Oozie, runtime for scripting languages - Hive and
Pig, scalable machine learning library - Apache Mahout. All this components
are tightly coupled together and datastore part can't be considered
separately, from other component. This is a the main reason why for Hadoop
installation and management are required  a separate solution, distinct
from generic enough™ datastore API. In the other case, this API will
contain a huge part, not relating to datastore technologies.


On Fri, Sep 13, 2013 at 8:17 PM, Michael Basnight mbasni...@gmail.comwrote:


 On Sep 13, 2013, at 9:05 AM, Alexander Kuznetsov wrote:

 
 
 
  On Fri, Sep 13, 2013 at 7:26 PM, Michael Basnight mbasni...@gmail.com
 wrote:
  On Sep 13, 2013, at 6:56 AM, Alexander Kuznetsov wrote:
   On Thu, Sep 12, 2013 at 7:30 PM, Michael Basnight mbasni...@gmail.com
 wrote:
   On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:
  
Sergey Lukjanov wrote:
   
[...]
As you can see, resources provisioning is just one of the features
 and the implementation details are not critical for overall architecture.
 It performs only the first step of the cluster setup. We’ve been
 considering Heat for a while, but ended up direct API calls in favor of
 speed and simplicity. Going forward Heat integration will be done by
 implementing extension mechanism [3] and [4] as part of Icehouse release.
   
The next part, Hadoop cluster configuration, already extensible and
 we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera
 plugin started too. This allow to unify management of different Hadoop
 distributions under single control plane. The plugins are responsible for
 correct Hadoop ecosystem configuration at already provisioned resources and
 use different Hadoop management tools like Ambari to setup and configure
 all cluster  services, so, there are no actual provisioning configs on
 Savanna side in this case. Savanna and its plugins encapsulate the
 knowledge of Hadoop internals and default configuration for Hadoop services.
   
My main gripe with Savanna is that it combines (in its upcoming
 release)
what sounds like to me two very different services: Hadoop cluster
provisioning service (like what Trove does for databases) and a
MapReduce+ data API service (like what Marconi does for queues).
   
Making it part of the same project (rather than two separate
 projects,
potentially sharing the same program) make discussions about shifting
some of its clustering ability to another library/project more
 complex
than they should be (see below).
   
Could you explain the benefit of having them within the same service,
rather than two services with one consuming the other ?
  
   And for the record, i dont think that Trove is the perfect fit for it
 today. We are still working on a clustering API. But when we create it, i
 would love the Savanna team's input, so we can try to make a pluggable API
 thats usable for people who want MySQL or Cassandra or even Hadoop. Im less
 a fan of a clustering library, because in the end, we will both have API
 calls like POST /clusters, GET /clusters, and there will be API duplication
 between the projects.
  
   I think that Cluster API (if it would be created) will be helpful not
 only for Trove and Savanna.  NoSQL, RDBMS and Hadoop are not unique
 software which can be clustered. What about different kind of messaging
 solutions like RabbitMQ, ActiveMQ or J2EE containers like JBoss, Weblogic
 and WebSphere, which often are installed in clustered mode. Messaging,
 databases, J2EE containers and Hadoop have their own management cycle. It
 will be confusing to make Cluster API a part of Trove which has different
 mission - database management and provisioning.
 
  Are you suggesting a 3rd program, cluster as a service? Trove is trying
 to target a generic enough™ API to tackle different technologies with
 plugins or some sort of extensions. This will include a scheduler to
 determine rack awareness. Even if we decide that both Savanna and Trove
 need their own API for building clusters, I still want to understand what
 makes the Savanna API and implementation different, and how Trove can build
 an API/system that can encompass multiple datastore technologies. So
 regardless of how this shakes out, I would urge you to go to the Trove
 clustering summit session [1] so we can share ideas.
 
  Generic enough™ API shouldn't contain a database specific calls like
 backups and restore (already in Trove).  Why we need a backup and restore
 operations for J2EE or messaging solutions?

 I dont mean to encompass J2EE or messaging solutions. Let me amend my
 email to say to tackle different datastore technologies. But going with
 this point… Do you not need to backup things in a J2EE container? Id assume
 a backup

[openstack-dev] [nova] [savanna] Host information for non admin users

2013-09-12 Thread Alexander Kuznetsov
Hi folks,

Currently Nova doesn’t provide information about the host of virtual
machine for non admin users. Is it possible to change this situation? This
information is needed in Hadoop deployment case. Because now Hadoop aware
about virtual environment and this knowledge help Hadoop to achieve a
better performance and robustness.

Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [savanna] Program name and Mission statement

2013-09-10 Thread Alexander Kuznetsov
My suggestion OpenStack Data Processing.


On Tue, Sep 10, 2013 at 4:15 PM, Sergey Lukjanov slukja...@mirantis.comwrote:

 Hi folks,

 due to the Incubator Application we should prepare Program name and
 Mission statement for Savanna, so, I want to start mailing thread about it.

 Please, provide any ideas here.

 P.S. List of existing programs: https://wiki.openstack.org/wiki/Programs
 P.P.S. https://wiki.openstack.org/wiki/Governance/NewPrograms

 Sincerely yours,
 Sergey Lukjanov
 Savanna Technical Lead
 Mirantis Inc.


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Savanna-all] Savanna EDP sequence diagrams added for discussion...

2013-07-22 Thread Alexander Kuznetsov
I updated the REST API draft -
https://etherpad.openstack.org/savanna_API_draft_EDP_extensions. New
methods related to job source and data discovery components were added;
also the job object was updated.


On Fri, Jul 19, 2013 at 12:26 AM, Trevor McKay tmc...@redhat.com wrote:

 fyi, updates to the diagram based on feedback

 On Thu, 2013-07-18 at 13:49 -0400, Trevor McKay wrote:
  Hi all,
 
Here is a page to hold sequence diagrams for Savanna EDP,
  based on current launchpad blueprints.  We thought it might be helpful to
  create some diagrams for discussion as the component specs are written
 and the
  API is worked out:
 
https://wiki.openstack.org/wiki/Savanna/EDP_Sequences
 
(The main page for EDP is here
 https://wiki.openstack.org/wiki/Savanna/EDP )
 
There is an initial sequence there, along with a link to the source
  for generating the PNG with PlantUML.  Feedback would be great, either
  through IRC, email, comments on the wiki, or by modifying
  the sequence and/or posting additional sequences.
 
The sequences can be generated/modified easily with with Plantuml which
  installs as a single jar file:
 
http://plantuml.sourceforge.net/download.html
 
java -jar plantuml.jar
 
Choose the directory which contains plantuml text files and it will
  monitor, generate, and update PNGs as you save/modify text files. I
 thought
  it was broken the first time I ran it because there are no controls :)
  Very simple.
 
  Best,
 
  Trevor
 
 



 --
 Mailing list: https://launchpad.net/~savanna-all
 Post to : savanna-...@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~savanna-all
 More help   : https://help.launchpad.net/ListHelp

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] savanna version 0.3 - added UI mockups for EDP workflow

2013-07-12 Thread Alexander Kuznetsov
On the tab with parameters, we see case for Hadoop streaming API. Could you
please add more examples for parameters tab including cases for Hadoop jar,
Pig and Hive scripts?

Thanks,
Alexander Kuznetsov.


On Fri, Jul 12, 2013 at 7:14 PM, Chad Roberts crobe...@redhat.com wrote:

 I have added some initial UI mockups for version 0.3.
 Any comments are appreciated.

 https://wiki.openstack.org/wiki/Savanna/UIMockups/JobCreation

 Thanks,
 Chad

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Savanna-all] Blueprints for EDP components

2013-07-11 Thread Alexander Kuznetsov
Hi,

Blueprints for EDP components on launchpad are added

https://blueprints.launchpad.net/savanna/+spec/job-manager-components
https://blueprints.launchpad.net/savanna/+spec/data-discovery-component
https://blueprints.launchpad.net/savanna/+spec/job-source-component
https://blueprints.launchpad.net/savanna/+spec/methods-for-plugin-api-to-support-edp

Each blueprint contains short component descriptions, objects model and
methods, which will be implemented in this component.

Your comments and suggestions are welcome.

Thanks,
Alexander Kuznetsov.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Savanna version 0.3 - on demand Hadoop task execution

2013-07-02 Thread Alexander Kuznetsov
We want to initiate discussion about Elastic Data Processing (EDP) Savanna
component. This functionality is planned to be implemented in the next
development phase starting on July 15. The main questions to address:

   -

   what kind of functionality should be implemented for EDP?
   -

   what are the main components and their responsibilities?
   -

   which existing tools like Hue or Oozie should be used?


To have something to start, we have prepared an overview of our thoughts in
the following document https://wiki.openstack.org/wiki/Savanna/EDP. For you
convenience, you can find the text below. Your comments and suggestions are
welcome.

Key Features

Starting the job:

   -

   Simple REST API and UI
   -

   TODO: mockups
   -

   Job can be entered through UI/API or pulled through VCS
   -

   Configurable data source



Job execution modes:

   -

   Run job on one of the existing cluster
   -

  Expose information on cluster load
  -

  Provide hints for optimizing data locality TODO: more details
  -

   Create new transient cluster for the job


Job structure:

   -

   Individual job via jar file, Pig or Hive script
   -

   Oozie workflow
   -

  In future to support EMR job flows import


Job execution tracking and monitoring

   -

   Any existent components that can help to visualize? (Twitter
Ambrosehttps://github.com/twitter/ambrose
   )
   -

   Terminate job
   -

   Auto-scaling functionality


Main EDP Components Data discovery component

EDP can have several sources of data for processing. Data can be pulled
from Swift, GlusterFS or NoSQL database like Cassandra or HBase. To provide
an unified access to this data we’ll introduce a component responsible for
discovering data location and providing right configuration for Hadoop
cluster. It should have a pluggable system.
Job Source

Users would like to execute different types of jobs: jar file, Pig and Hive
scripts, Oozie job flows, etc.  Job description and source code can be
supplied in a different way. Some users just want to insert hive script and
run it. Other users want to save this script in Savanna internal database
for later use. We also need to provide an ability to run a job from source
code stored in vcs.
Savanna Dispatcher Component

This component is responsible for provisioning a new cluster, scheduling
job on new or existing cluster, resizing cluster and gathering information
from clusters about current jobs and utilization. Also, it should provide
information to help to make a right decision where to schedule job, create
a new cluster or use existing one. For example, current loads on clusters,
their proximity to the data location etc.
UI Component

Integration into OpenStack Dashboard - Horizon. It should provide
instruments for job creation, monitoring etc.

Cloudera Hue already provides part of this functionality: submit jobs (jar
file, Hive, Pig, Impala), view job status and output.
Cluster Level Coordination Component

Expose information about jobs on a specific cluster. Possible this
component should be represent by existing Hadoop projects Hue and Oozie.
User Workflow

- User selects or creates a job to run

- User chooses data source for appropriate type for this job

- Dispatcher provides hints to user about a better way to scheduler this
job (on existing clusters or create a new one)

- User makes a decision based on the hint from dispatcher

- Dispatcher (if needed) creates or resizes existing cluster and schedules
job to it
- Dispatcher periodically pull status of job and shows it on UI

Thanks,

Alexander Kuznetsov
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev