Re: feedback about juju after using it for a few months

2014-12-18 Thread Marco Ceppi
On Thu Dec 18 2014 at 1:00:46 AM John Meinel j...@arbash-meinel.com wrote:

 ...


 9. If you want to cancel a deployment that just started you need to keep
 running remove-service forever. Juju will simply ignore you if it's still
 running some special bits of the charm or if you have previously asked it
 to cancel the deployment during its setting up. No errors, no other
 messages are printed. You need to actually open its log to see that it's
 still stuck in a long apt-get installation and you have to wait until the
 right moment to remove-service again. And if your connection is slow, that
 takes time, you'll have to babysit Juju here because it doesn't really
 control its services as I imagined. Somehow apt-get gets what it wants :-)


 You can now force-kill a machine. So you can run `juju destroy-service
 $service` then `juju terminate-machine --force #machine_number`. Just make
 sure that nothing else exists on that machine! I'll raise an issue for
 having a way to add a --force flag to destroying a service so you can just
 say kill this with fire, now plz


 I understand that, but I discovered it's way faster and less typing if I
 simply destroy-environment and bootstrap it again. If you need to force
 kill something every time you need to kill it, then perhaps somethings is
 wrong?


 I agree, something is wrong with the UX here. We need to (and would love
 your feedback) figure out what should happen here. The idea is, if a
 service experiences a hook failure, all events are halted, including the
 destroy event. So the service is marked as dying but it can't die until the
 error is resolved. There are cases, where during unit termination, that you
 may wish to inspect an error. I think adding a `--force` flag to destroy
 service would satisfy what you've outlined, where --force will ignore hook
 errors during the destruction of a service.

 Thanks,
 Marco Ceppi


 IIRC, the reason we support juju destroy-machine --force but not juju
 destroy-unit --force is because in the former case, because the machine is
 no-more Juju has ensured that cleanup of resources really has happened.
 (There are no more machines running that have software running you don't
 want.)
 The difficulty with juju destroy-unit --force is that it doesn't
 necessarily kill the machine, and thus an unclean teardown could easily
 leave the original services running (consider collocated deployments).
 juju destroy-service --force falls into a similar issue, only a bit more
 so since some units may be on shared machines and some may be all by
 themselves.


Right, and I agree. This isn't the best thing for --force at a service or
unit level. What I would like to see instead is the scenario I just typed
deploy on this service and I have three units and I don't want it anymore
or this is a mistake in which case destroy-service --force would execute
the destruction of the service and set juju into a state where when a hook
errors (or if it's in an error state) auto-resolve that and continue with
service destruction. Then the machine can just be reaped with the upcoming
machine reaper stuff and everything moves forward.

That said, I feel like we're doing a little throwing the baby out with the
 bathwater. If you are in a situation where there is just one unit on each
 machine, then destroy-unit --force could be equivalent to destroy-machine
 --force, and that could chain up into destroy-service --force (if all units
 of service are the only thing on their machines, then tear them all down
 ignoring errors and stop the machines).

 John
 =:-

-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


feedback about juju after using it for a few months

2014-12-17 Thread Caio Begotti
Folks, I just wanted to share my experience with Juju during the last few
months using it for real at work. I know it's pretty long but stay with me
as I wanted to see if some of these points are bugs, design decisions or if
we could simply to talk about them :-)

General:

1. Seems that if you happen to have more than... say, 30 machines, Juju
starts behaving weirdly until you remove unused machines. One of the weird
things is that new deploys all stay stuck with a pending status. That
happened at least 4 times, so now I always destroy-environment when testing
things just in case. Have anyone else seen this behaviour? Can this because
of LXC with Juju local? I do a lot of Juju testing so it's not usual for me
to have a couple hundreds of machines after a mont by the way.

2. It's not reliable to use Juju in laptops, which I can understand why of
course but just in case... if the system is suspended Juju will not recover
itself like the rest of the system services. It looses its connection from
its API apparently? Hooks fail too (resuming always seems to call
hooks/config-changed)? Is this just with me?

3. The docs recommend writing charms in Python versus shell script.
Compared to Python they are subpar enough that I'd recommend saying they
are not officially supported then. It's quite common to have race
conditions in charms written in shell script. You have to keep polling the
status of things because if you just call deploys and set relations in a
row they will fail, because Juju won't queue the commands in a logical
sequence, it'll just run them dumbly and developers are left in the wild to
control it. I'm assuming a Python charm does not have this problem at all?

4. It's not very clear how many times hooks/config-changed runs to me, I'd
just guess many :-) so you have to pay attention to it and write extra
checks to avoid multiple harmful runs of this hook. I'd say the sequence
and number of hooks called by a new deploy is not very clear based on the
documentation because of this. Hmm perhaps I could print debug it and count
the hits...

5. Juju should queue multiple deployment in order not to hurt performance,
both of disk and network IO. More than 3 deployments in parallel on my
machine makes it all really slow. I just leave Juju for a while and go get
some coffee because the system goes crazy. Or I have to break up manually
the deployments, while Juju could have just queued it all and the CLI could
simply display it as queued instead. I know it would need to analyse the
machine's hardware to guess a number different from 3 but think about it if
your deployments have about 10 different services... things that take 20
minutes can easily take over 1 hour.

6. There is no way to know if a relation exists and if it's active or not,
so you need to write dummy conditionals in your hooks to work around that.
IMHO it's hackish to check variables that are only non-empty during a
relation because they will vanish anyway. A command to list the currently
set relations would be awesome to have, both inside the hooks and in the
CLI. Perhaps charmhelpers.core.services.helpers.RelationContext could be
used for this but I'm not totally sure as you only get the relation data
and you need to know the relation name in advance anyway, right?

7. When a hook fails (most usually during relations being set) I have to
manually run resolved unit/0 multiple times. It's not enough to call it
once and wait for Juju to get it straight. I have to babysit the unit and
keep running resolved unit/0, while I imagined this should be automatic
because I wanted it resolved for real anyway. If the failed hook was the
first in a chain, you'll have to re-run this for every other hook in the
sequence. Once for a relation, another for config-changed, then perhaps
another for the stop hook and another one for start hook, depending on your
setup.

8. Do we have to monitor and wait a relation variable to be set? I've
noticed that sometimes I want to get its value right away in the relation
hook but it's not assigned yet by the other service. So I'm finding myself
adding sleep commands when it happens, and that's quite hackish I think?
IMHO the command to get a variable from a relation should be blocking until
a value is returned so the charm doesn't have any timing issues. I see that
happening with rabbitmq-server's charm all the time, for instance.

9. If you want to cancel a deployment that just started you need to keep
running remove-service forever. Juju will simply ignore you if it's still
running some special bits of the charm or if you have previously asked it
to cancel the deployment during its setting up. No errors, no other
messages are printed. You need to actually open its log to see that it's
still stuck in a long apt-get installation and you have to wait until the
right moment to remove-service again. And if your connection is slow, that
takes time, you'll have to babysit Juju here because it doesn't really
control its 

Re: feedback about juju after using it for a few months

2014-12-17 Thread Richard Harding
On Wed, 17 Dec 2014, Caio Begotti wrote:

 Folks, I just wanted to share my experience with Juju during the last few
 months using it for real at work. I know it's pretty long but stay with me
 as I wanted to see if some of these points are bugs, design decisions or if
 we could simply to talk about them :-)

Thanks for the great feedback. I've got some replies and we'd love to help
improve the experience.


 Juju GUI:

 11. Juju's GUI's bells and whistles are nice, but I think there's a bug
 with it because its statuses are inaccurate. If you set a relation, Juju
 says the relation is green and active immediately, which is not true if you
 keep tailing the log file and you know things can still fail because
 scripts are still running.

The relation is green, but if it errors after some time it should turn red
with error info. If the relation goes into an error state and it does not
then that's a bug we'd love to fix. If you could file the bug and let us
know if there's two services that this is easily replicated with that's be
awesome!

https://bugs.launchpad.net/juju-gui/+filebug

 12. Cancelling actions on Juju's GUI does not make much sense since you
 need to click on commit, then click on clear, then confirm it. Why not
 simply having a different cancel button instead? It's like shutting down
 Windows from the start menu. The cancel button should cancel the action,
 and the actual X button should simply dismiss it. That clear button seems
 useless UX-wise?

Thanks, we'll take this feedback to the UX team. The deployment bar is a
new UX item and getting feedback on the use of it is greatly appreciated.

 13. Juju's GUI's panel with charmstore stays open all the time wasting
 window space (so I have to zoom out virtually all my deployments because of
 the amount of wasted space, every time). There could be a way to hide that
 panel, because honestly it's useless locally since it never lists my local
 charms even if I export JUJU_REPOSITORY correctly. I'd rather have my local
 charms listed there too or just hide the panel instead.

You can hide the panel. If you type '?' a series of keyboard shortcuts come
up. One of them is to toggle the sidebar using 'ctrl-shift-h' (hide).
Please try that out and let us know if that helps or not. As we improve the
sidebar and make the added services bar more prevalent we hope the sidebar
being there is more and more useful.


 13. Juju's GUI shows new relations info incorrectly. If I set up a DB
 relation to my service it simply says in the confirmation window that db
 relation added between postgresql and postgresql. I've noticed sometimes
 this changes to between myservice and myservice so perhaps it has to do
 with the order of the relation, from what service to the other? Anyway,
 both cases seem to show it wrong?

Thanks, we'll look into this. Is there two services you can replicate this
every time or is it something that happens less consistently?

 14. Juju's GUI always shows the service panel even if the service unit has
 been destroyed, just because I opened it once. Also, it says 1 dying
 units (sic) forever until I close it manually.

By service panel is this the details panel that slides out from the left
sidebar? We can definitely look into making sure those go away when the
unit or service are destroyed.

 15. Why subordinate charms don't have a color bar beneath their icons too?
 Because if it fails then it will appear in red right? Why not always
 display it to indicate it's been correctly deployed or set up?

There's a UX decision to try not to highlight subordinates unless there's an
issue because they tend to clutter the UI. With the new added services bar
and the ability to show/hide them perhaps it's something we should revisit.

 16. Juju's GUI lists all my machines. Like, all of them, really. In the
 added services part of the panel it lists even inactive machines, which
 does not make much sense I'd say because it makes it seem only deployed
 machines are listed. I think that count is wrong.

The GUI lists the machines it knows about from Juju. I'm not sure about
hiding them because in machine view we use them for targets to deploy
things to. Now machines are only listed in machine view, but you mention
seeing them in the 'added services' panel? Do you have a screenshot of what
you mean we could take a look at?

 That's it, thank you for those who made it to the end :-D

And thank you for taking the time to write out the great feedback.

--

Rick Harding

Juju UI Engineering
https://launchpad.net/~rharding
@mitechie

-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: feedback about juju after using it for a few months

2014-12-17 Thread Tim Penhey
On 18/12/14 11:24, Caio Begotti wrote:
 Folks, I just wanted to share my experience with Juju during the last
 few months using it for real at work. I know it's pretty long but stay
 with me as I wanted to see if some of these points are bugs, design
 decisions or if we could simply to talk about them :-)
 
 General:
 
 1. Seems that if you happen to have more than... say, 30 machines, Juju
 starts behaving weirdly until you remove unused machines. One of the
 weird things is that new deploys all stay stuck with a pending status.
 That happened at least 4 times, so now I always destroy-environment when
 testing things just in case. Have anyone else seen this behaviour? Can
 this because of LXC with Juju local? I do a lot of Juju testing so it's
 not usual for me to have a couple hundreds of machines after a mont by
 the way.

I'll answer this one now.  This is due to not enough file handles.  It
seems that the LXC containers that get created inherit the handles of
the parent process, which is the machine agent.  After a certain number
of machines, and it may be around 30, the new machines start failing to
recognise the new upstart script because inotify isn't working properly.
This means the agents don't start, and don't tell the state server they
are running, which means the machines stay pending even though lxc says
yep you're all good.

I'm not sure how big we can make the limit nofile in the agent upstart
script without it causing problems elsewhere.

Tim


-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: feedback about juju after using it for a few months

2014-12-17 Thread Caio Begotti
On Wed, Dec 17, 2014 at 8:47 PM, Tim Penhey tim.pen...@canonical.com
wrote:

  1. Seems that if you happen to have more than... say, 30 machines, Juju
  starts behaving weirdly until you remove unused machines. One of the
  weird things is that new deploys all stay stuck with a pending status.
  That happened at least 4 times, so now I always destroy-environment when
  testing things just in case. Have anyone else seen this behaviour? Can
  this because of LXC with Juju local? I do a lot of Juju testing so it's
  not usual for me to have a couple hundreds of machines after a mont by
  the way.

 I'll answer this one now.  This is due to not enough file handles.  It
 seems that the LXC containers that get created inherit the handles of
 the parent process, which is the machine agent.  After a certain number
 of machines, and it may be around 30, the new machines start failing to
 recognise the new upstart script because inotify isn't working properly.
 This means the agents don't start, and don't tell the state server they
 are running, which means the machines stay pending even though lxc says
 yep you're all good.

 I'm not sure how big we can make the limit nofile in the agent upstart
 script without it causing problems elsewhere.


Hey, that makes a lot of sense. I wonder if you can detect that in advance
and perhaps make Juju tell the sysadmin about the limit being reached (or
nearly reached) then?
-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: feedback about juju after using it for a few months

2014-12-17 Thread Marco Ceppi
Wow, what a great email and fantastic feedback. I'm going to attempt to
reply and address each item inline below.

I'm curious, what version of Juju are you currently using?

On Wed Dec 17 2014 at 5:25:08 PM Caio Begotti caio1...@gmail.com wrote:

 Folks, I just wanted to share my experience with Juju during the last few
 months using it for real at work. I know it's pretty long but stay with me
 as I wanted to see if some of these points are bugs, design decisions or if
 we could simply to talk about them :-)

 General:

 1. Seems that if you happen to have more than... say, 30 machines, Juju
 starts behaving weirdly until you remove unused machines. One of the weird
 things is that new deploys all stay stuck with a pending status. That
 happened at least 4 times, so now I always destroy-environment when testing
 things just in case. Have anyone else seen this behaviour? Can this because
 of LXC with Juju local? I do a lot of Juju testing so it's not usual for me
 to have a couple hundreds of machines after a mont by the way.


LXC can get...flaky, especially depending on the power of your machine. I
haven't seen an issue running 35 LXC containers with Juju on my desktop but
it's got i7 processors and 32GB of RAM :)

We're adding code that will reap empty machines after a short period of
time. This will help save you with your case and others who are running in
the cloud and don't want to spend money on cloud providers for machines
doing nothing!

2. It's not reliable to use Juju in laptops, which I can understand why of
 course but just in case... if the system is suspended Juju will not recover
 itself like the rest of the system services. It looses its connection from
 its API apparently? Hooks fail too (resuming always seems to call
 hooks/config-changed)? Is this just with me?


This is something I'm actually working on addressing by adding `juju local
suspend` and `juju local resume` commands via a `juju-local` plugin:
https://github.com/juju-solutions/juju-local I hope to have this out for
the new year. I'll also be cramming more functionality to make using the
local provider much more reliable and easy.


 3. The docs recommend writing charms in Python versus shell script.
 Compared to Python they are subpar enough that I'd recommend saying they
 are not officially supported then. It's quite common to have race
 conditions in charms written in shell script. You have to keep polling the
 status of things because if you just call deploys and set relations in a
 row they will fail, because Juju won't queue the commands in a logical
 sequence, it'll just run them dumbly and developers are left in the wild to
 control it. I'm assuming a Python charm does not have this problem at all?


So, shell charms are fine, and we have a quite a few that are written well.
We can discourage people from using them, but juju and charms is about
choice and freedom. If an author wants to write charms in bash that's fine
- we will just hold them to the same standard as all other charms.
Something we've been diligently working on is charm testing. We're nearing
the conclusion of the effort to add some semblance of testing to each charm
and run those charms against all substrates and architectures we support.
In doing so we can find poorly written charms and charms written well
(regardless of language of charm).

Polling is something all charms will do, but I will address this more later
on with your question about blocking on relation-get.

4. It's not very clear how many times hooks/config-changed runs to me, I'd
 just guess many :-) so you have to pay attention to it and write extra
 checks to avoid multiple harmful runs of this hook. I'd say the sequence
 and number of hooks called by a new deploy is not very clear based on the
 documentation because of this. Hmm perhaps I could print debug it and count
 the hits...


The sequence is pretty standard across all charms. The number of
invocations will always be 1 + N times. There is no guarantee on the number
of times a hook will execute. The standard sequence is as follows though:

$ juju deploy $charm

install - config-changed - start

$ juju set $charm key=val

config-changed

$ juju add-relation $charm:db $other_charm

db-relation-joined
db-relation-changed
(db-relation-changed everytime data on the relation wire changes)

In this case relation-changed will always run at least once, but typically
is execute more than once.

$ juju remove-relation $charm $other_charm

db-relation-departed
db-relation-broken

Again, these hooks may execute more than once, all hooks may execute more
than once. That's why hooks need to be idempotent.

$ juju upgrade-charm $charm

upgrade-charm
install

$ juju destroy-service $charm

stop


 5. Juju should queue multiple deployment in order not to hurt performance,
 both of disk and network IO. More than 3 deployments in parallel on my
 machine makes it all really slow. I just leave Juju for a while and go get
 some coffee because the 

Re: feedback about juju after using it for a few months

2014-12-17 Thread Caio Begotti
On Wed, Dec 17, 2014 at 8:44 PM, Richard Harding rick.hard...@canonical.com
 wrote:

  11. Juju's GUI's bells and whistles are nice, but I think there's a bug

 with it because its statuses are inaccurate. If you set a relation, Juju
  says the relation is green and active immediately, which is not true if
 you
  keep tailing the log file and you know things can still fail because
  scripts are still running.

 The relation is green, but if it errors after some time it should turn red
 with error info. If the relation goes into an error state and it does not
 then that's a bug we'd love to fix. If you could file the bug and let us
 know if there's two services that this is easily replicated with that's be
 awesome!


I understand that, but the UI is telling me all is good, you're free to go
on and play with the other Juju magic here in the GUI then I do that and
things start to fail because that green bar actually should not have been
turned green as we have scripts running, though I understand the principle
with your explanation :-) perhaps it could be transparent, or pulsing
green, or any other temporary state in between?

I know it may fail and it will turn red, but it does not matter because I
would never set a second relation between my units if one of them isn't
really ready to (i.e. green). When I see it green, I think it's ready.
Except it's not :-(

 13. Juju's GUI's panel with charmstore stays open all the time wasting
  window space (so I have to zoom out virtually all my deployments because
 of
  the amount of wasted space, every time). There could be a way to hide
 that
  panel, because honestly it's useless locally since it never lists my
 local
  charms even if I export JUJU_REPOSITORY correctly. I'd rather have my
 local
  charms listed there too or just hide the panel instead.

 You can hide the panel. If you type '?' a series of keyboard shortcuts come
 up. One of them is to toggle the sidebar using 'ctrl-shift-h' (hide).
 Please try that out and let us know if that helps or not. As we improve the
 sidebar and make the added services bar more prevalent we hope the sidebar
 being there is more and more useful.


Perhaps I didn't notice that because I always have the GUI in a separate
monitor so I just use it with the mouse, sorry. But if you can hide it with
a shortcut what stops us from having a clickable area in there to hide with
the cursor? Or is there one?


  13. Juju's GUI shows new relations info incorrectly. If I set up a DB
  relation to my service it simply says in the confirmation window that db
  relation added between postgresql and postgresql. I've noticed sometimes
  this changes to between myservice and myservice so perhaps it has to do
  with the order of the relation, from what service to the other? Anyway,
  both cases seem to show it wrong?

 Thanks, we'll look into this. Is there two services you can replicate this
 every time or is it something that happens less consistently?


I've seen that with the Postgres charm in the store more specifically. But
I think with Apache's and RabbitMQ's too, then I started to wonder if it
wasn't a problem with the GUI instead with the charms.



  14. Juju's GUI always shows the service panel even if the service unit
 has
  been destroyed, just because I opened it once. Also, it says 1 dying
  units (sic) forever until I close it manually.

 By service panel is this the details panel that slides out from the left
 sidebar? We can definitely look into making sure those go away when the
 unit or service are destroyed.


Yep! That one :-)


  16. Juju's GUI lists all my machines. Like, all of them, really. In the
  added services part of the panel it lists even inactive machines, which
  does not make much sense I'd say because it makes it seem only deployed
  machines are listed. I think that count is wrong.

 The GUI lists the machines it knows about from Juju. I'm not sure about
 hiding them because in machine view we use them for targets to deploy
 things to. Now machines are only listed in machine view, but you mention
 seeing them in the 'added services' panel? Do you have a screenshot of what
 you mean we could take a look at?


Not now, but I can take one tomorrow. I don't see the machines themselves,
it's just their count in the added services panel, which is odd because I
don't have that many machines active in the deployment, that's why. I'll
make a note so I don't forget to take the screenshot :-)


  That's it, thank you for those who made it to the end :-D

 And thank you for taking the time to write out the great feedback.


Cool beans!
-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: feedback about juju after using it for a few months

2014-12-17 Thread John Meinel
...


 9. If you want to cancel a deployment that just started you need to keep
 running remove-service forever. Juju will simply ignore you if it's still
 running some special bits of the charm or if you have previously asked it
 to cancel the deployment during its setting up. No errors, no other
 messages are printed. You need to actually open its log to see that it's
 still stuck in a long apt-get installation and you have to wait until the
 right moment to remove-service again. And if your connection is slow, that
 takes time, you'll have to babysit Juju here because it doesn't really
 control its services as I imagined. Somehow apt-get gets what it wants :-)


 You can now force-kill a machine. So you can run `juju destroy-service
 $service` then `juju terminate-machine --force #machine_number`. Just make
 sure that nothing else exists on that machine! I'll raise an issue for
 having a way to add a --force flag to destroying a service so you can just
 say kill this with fire, now plz


 I understand that, but I discovered it's way faster and less typing if I
 simply destroy-environment and bootstrap it again. If you need to force
 kill something every time you need to kill it, then perhaps somethings is
 wrong?


 I agree, something is wrong with the UX here. We need to (and would love
 your feedback) figure out what should happen here. The idea is, if a
 service experiences a hook failure, all events are halted, including the
 destroy event. So the service is marked as dying but it can't die until the
 error is resolved. There are cases, where during unit termination, that you
 may wish to inspect an error. I think adding a `--force` flag to destroy
 service would satisfy what you've outlined, where --force will ignore hook
 errors during the destruction of a service.

 Thanks,
 Marco Ceppi


IIRC, the reason we support juju destroy-machine --force but not juju
destroy-unit --force is because in the former case, because the machine is
no-more Juju has ensured that cleanup of resources really has happened.
(There are no more machines running that have software running you don't
want.)
The difficulty with juju destroy-unit --force is that it doesn't
necessarily kill the machine, and thus an unclean teardown could easily
leave the original services running (consider collocated deployments).
juju destroy-service --force falls into a similar issue, only a bit more
so since some units may be on shared machines and some may be all by
themselves.

That said, I feel like we're doing a little throwing the baby out with the
bathwater. If you are in a situation where there is just one unit on each
machine, then destroy-unit --force could be equivalent to destroy-machine
--force, and that could chain up into destroy-service --force (if all units
of service are the only thing on their machines, then tear them all down
ignoring errors and stop the machines).

John
=:-
-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju