[Sysadmins] Timing automated batch jobs ...

Nick Hilliard nick at netability.ie
Wed Sep 16 19:03:08 IST 2009

On 15/09/2009 12:27, Colm MacCárthaigh wrote:
> For some reason, I keep coming across the pattern of cron jobs that
> have "sleep $RANDOM % 3600" in them - in a vain attempt to spread
> load. It's a really harmful dangerous pattern, it annoyed me so much,
> that I've written about it ;

This practice stinks in a lot of ways, but the previous default was far 
worse - i.e. do everything at exactly zero past the hour.  I've attached a 
some PNGs of INEX's AS112 server from today which give some indication of 
how badly regular synchronisation stinks in real life.  Given the low 
traffic levels on this box, it doesn't cause a problem, but it's pretty 
clear that there's something very peculiar going on which would have the 
potential to cause a lot of trouble if traffic levels were high.  I haven't 
looked closely at the exact causes of the problem, but I suspect a mass 
effect of some form.

The point of all of this being this: adding timing entropy into the mix 
here would help things quite a lot for the most part.  However, it would 
probably cause less regular but larger peaks according to normal 
distribution, which will - as you point out - cause occasional chaos but 
will actually work well most of the time.  So it depends on what sort of 
problems you want to experience in practice: predictable peaks or less 
predictable peaks.

There are lots of other factors in the mix here: you might be using offset 
cron jobs to avoid local peak resource utilisation problems on the client 
box.  And small scale systems (like inex's as112 box) will behave 
differently to larger ones (e.g. Skype with their windows update problem - 
and with due respect to Skype, thundering hoard problems on large scale 
system startups are not unpredictable). I'm not sure that your proposed 
cure of centralised resource arbitration is necessarily going to be better 
than the disease.  It adds lots of complexity and that is bad in itself.

resource curmudgeon
