I recently read about the SELFMAN project on ReadWriteWeb (Coming Soon: Internet Apps that Heal Themselves). This project, started in 2006, aims to standardize and package the infrastructure needed to manage large-scale distributed applications. So far, they've produced some sample / reference applications using components written in Java. I don't know if the bits themselves will find their way to your project any time soon, but if you're working on large-scale applications, you should definitely take a look at their approach.
As far as I can tell, much of the actual software that's been produced thus far deals with managing an application where you control all the pieces. The techniques used by this software are interesting in and of themselves, but I'm really more interested to see if any of these techniques can be broadened in application to help federalized systems cope with unreliable components.
Right now, we're in the early days of applied cloud computing, and we're seeing a fairly regular parade of large-scale, highly-publicized cloud failures from big-name players like Google, Amazon, Facebook, and Twitter. As we deploy more applications into the cloud and construct these applications to more seamlessly integrate with all the other applications in the cloud, we start to introduce some really insidious dependencies. How can we ensure that our application doesn't crash every time twitter burps?
At present, this sort of resilience must be custom-crafted in each and every application, which is time-consuming and error-prone. I'd love to see more thought put into standardized approaches like those used by SELFMAN so that enterprise resiliency can become the norm rather than the exception.