State vs. Events

I had an interesting experience during an application migration a while back. The application used a no-sql data store and and event streaming platform, and an unforseen hiccup resulted in the event stream not being able to make the migration smoothly - we'd wind up starting up the "after" version of this application with nothing at all in the event stream, though we'd have all the no-sql data from thr prior application.

My initial reaction was panic. There's no way this could stand. But in examining the problem, not only was it going to be genuinely difficult to port the live event stream, it turned out our application didn't really care too much as long as we throttled down traffic before migrating state. That was the moment I came to really understand state vs. events, and how an event streaming application (or service) is fundamentally different than a traditional state-based application.

Retrain your brain

We've been trained to build applications with a focus on state and how we store it. This can make it hard to transition to an architecture of services where things are in motion -- events. There will always be circumstances in which at-rest state is appropriate and necessary -- reporting, for instance. But when analyzing software requirements, look for places where state-bias causes you to think about dynamic systems as if you're looking at static snapshots.

A shopping cart / order scenario is typical of this sort of state flattening. Consider the simple order shown here - an order with three line items. Ignoring how this order came to be formed, much information is potentially lost. It's possible this customer considered other items before settling on the items in this cart, and by considering only the state at the end of check-out, much value is lost.

This alternative stream yields the same final shopping cart, but in this case, we can see additional / alternative items that were abandoned in the cart. This view is more aligned to what you'd expect in an event stream like Kafka vs. a state store (SQL or no-SQL).

In this example, it's likely we'd want to understand why items (a) and (b) were removed - perhaps because of availability or undesirable shipping terms. We'd also want to understand whether the new items (2) and (3) were added as replacements or alternatives to (a) and (b). Regardless, the absence of the events in the first place prevents any such awareness, let alone analysis.

As a developer, you might relate to an example that hits closer to home. Recall the largest source code base you've worked on -- ideally one with several years of history. Now zip it up -- minus the .git folder, and remove the hosted repository. That static source snapshot certainly has value, but consider the information you've lost. You no longer are able to see who contributed which changes or when. Assuming you've tied commits to work items, you've lost that traceability, too. The zip file is a perfect representation of state, but without being able to see the events that created that state, you're in a very diminished capacity.

Hybrid approaches

Of course, many systems combine state and events, and the focus on one or the other could be visualized on a continuum. On the left, storing only events, you'd have something that looks like a pure event-sourcing design1. If and when you need to see current state, the only way to get that is to replay all the events in the stream for the period that interests you. On the right, of course, you have only current state, with no record of how that state came to be.

In practice, most services and almost all applications will be somewhere between the extremes of this continuum. Even a pure event-sourcing design will store state in projections for use in queries and/or snapshots to mark waypoints in state. Even more likely will be a mix of services with relative emphasis differing from service to service:

There's no right answer here, and the best design for many applications will employ a hybrid approach like this to capure dynamic behavior where it's appropriate without losing speed advantages typically seen in flattened-state data stores. When considering these alternatives, though, I'd encourage you to keep your historical bias in mind and try not to lose event data when it may be important.


  1. For more on event sourcing, see Martin Fowler's excellent intro as well as this excellent intro on EventStore.com. ↩︎

Dyamic Validation using FluentValidation

Validation in a C# class is an area where configuration / customization pressures can sneak into an application. Simple scenarios are typically easily-served with some built-in features, but as complexity builds, you may be tempted to start building some hard-coded if-then blocks that contribute to technical debt. In this article, I'll look at one technique to combat that hard-coding.

Simple Validation

Consider an application tracking cars on a dealer's lot. For this example, the car model is super-simple -- just make, model, and VIN:

public class Vehicle
    {
        public string Make { get; set; }
        public string Model { get; set; }
        public string VIN { get; set; }
    }

Microsoft has a data annotation library that can handle very simple validations just by annotating properties, like this:

        [Required]
        public string Make { get; set; }

When used with the data annotations validation context, this is enough to tell us a model isn't considered valid:

            var v = new Vehicle.Vehicle();
            Assert.IsNotNull(v);
            ValidationContext context = new ValidationContext(v);
            List<ValidationResult> validationResults = new List<ValidationResult>();
            bool valid = Validator.TryValidateObject(v, context, validationResults, true);
            Assert.IsFalse(valid);

This is not, however, especially flexible. Data annotation-based validations become more difficult under circumstances like:

  • Multiple values / properties participate in a single rule.
  • Sometimes a validation is considered a problem, and sometimes it's not.
  • You want to customize the validation message (beyond simple resx-based resources).

Introducing FluentValidation

Switching this to use FluentValidation, we pick up a class to do validation now, and the syntax for executing the validation itself isn't much different so far.

    public class VehicleValidator:AbstractValidator<Vehicle>
    {
        public VehicleValidator()
        {
            RuleFor(vehicle => vehicle.Make).NotNull();
        }
    }
...
            // testing
            var v = new Vehicle.Vehicle();
            var validator = new VehicleValidator();
            var result = validator.Validate(v);

By itself, this change doesn't make much difference, but because we've switched to FluentValidation, a few new use cases are a bit easier. Forewarned - I'm using some contrived examples here that are super-simplified for clarity.

Contrived Example 1 - search model

Again, this example isn't what you'd use at scale, but let's say you'd like to use the Vehicle mode under two different scenarios with different sets of validation rules. When adding a new vehicle to the lot, for instance, you'd want to be sure you're filling in fields required to properly identify a real car -- make, model, vin (and presumably a boatload of other fields, too). This looks similar to the example above where we required only Make. The same model could be used as a request in a search method, but validation would be very different - in this case, we don't expect all the fields to be present, but we very well might want to ask for at least one of the fields to be present.

        public SearchVehicleValidator()
        {
            RuleFor(v => v.Make).NotNull()
              .When(v => string.IsNullOrEmpty(v.Model) && string.IsNullOrEmpty(v.VIN))
              .WithMessage("At least one search criteria is required(1)");

            RuleFor(v => v.Model).NotNull()
              .When(v => v.Model != null)
              .WithMessage("At least one search criteria is required(2)");

            RuleFor(v => v.VIN).NotNull()
              .When(v => v.Make != null && v.Model != null)
              .WithMessage("At least one search criteria is required(3)");
        }
    }

In FluentValidate, this syntax isn't too bad with just three properties, but as your model increases in size, you'd want a more elegant solution. In any event, we can now validate the same class with different validation rules, depending on how we intend to use it:

            var v = new Vehicle.Vehicle() { Make = "Yugo"};
            var svalidator = new SearchVehicleValidator();
            var result = svalidator.Validate(v);
            Assert.IsTrue(result.IsValid);  // one non-null field here is sufficient to be valid 

            var lvalidator = new ListingVehicleValidator();
            result = lvalidator.Validate(v);
            Assert.IsFalse(result.IsValid);  // all fields must now be valid

Contrived Example #2:

Buckle up - we're going to shift into high gear on the contrived scale. For this example, let's say you're deploying this solution to several customers, and they have different rules about handling car purchases -- some expect to see a customer pre-qualified for credit prior to finalizing a purchase price, and some won't. Again, in this case, the example isn't as important as the technique, so please withhold judgement on the former. In addition to extending the Vehicle model to become a VehicleQuote model:

    public class VehicleQuote: Vehicle
    {
        public decimal? PurchasePrice { get; set; }
        public bool? CreditApproved { get; set; }
    }

We're also going to validate slightly differently. Here, note the context we're passing in and the dynamic severity for the credit approved rule:

    public class PurchaseVehicleValidator : AbstractValidator<VehicleQuote>
    {
        VehicleContext cx;
        public PurchaseVehicleValidator(VehicleContext cx)
        {
            this.cx = cx;
            RuleFor(vehicle => vehicle.Make).NotNull();
            RuleFor(vehicle => vehicle.Model).NotNull();
            RuleFor(vehicle => vehicle.VIN).NotNull();
            // dynamic rules
            RuleFor(vehicle => vehicle.CreditApproved).Must(c => c == true).WithSeverity(cx.CreditApprovedSeverity);
        }
    }

This lets us pass in an indicator for whether we consider credit approved to be an error, warning, or even an info message. The tests evaluate "valid" differently here, as any message hit at all -- even one with an Info severity -- will evaluate IsValid as false.

        var vq = new Vehicle.VehicleQuote() { Make = "Yugo", Model = "Yugo", VIN = "some vin" };
        var cx = new VehicleContext() { CreditApprovedSeverity = FluentValidation.Severity.Error };
        var qvalidator = new QuoteVehicleValidator(cx);
        var result = qvalidator.Validate(vq);
        Assert.IsTrue(result.Errors.Any(e => e.Severity == FluentValidation.Severity.Error));  // non-approved is an error here.

        cx.CreditApprovedSeverity = FluentValidation.Severity.Info;
        qvalidator = new QuoteVehicleValidator(cx);
        result = qvalidator.Validate(vq);
        Assert.IsFalse(result.Errors.Any(e => e.Severity == FluentValidation.Severity.Error));  // all fields must now be valid

Note that the same technique can be used for validation messages -- you could drive these from a ResX file or make them entirely dyanmic if you need to create different experiences under different circumstances.

Wrapping it up

The examples here are clearly light in depth, but they should show some of the ways FluentValidation can create dynamic behavior in validation without incurring the technical debt of custom code or a labrynth of if-then's. These techniques are simple and easy to test, as well.

You can find the code for this article at https://github.com/dlambert-personal/dynamic-validation

Customer Lifecyle Architecture

If you've gone through any "intro to cloud development" presentations, you've seen the argument for cloud architecture where they talk about the needs of infrastructure scaling up and down. Maybe you run a coffee shop bracing for the onslaught of pumpking-spice-everything, or maybe you're just in a 9-5 office where activity dries up after hours. In any event, this scale-up / scale-down traffic shaping is well-known to most technologists at this point.

But, there's another type of activity shaping I'd like to explore. In this case, we're not looking at the aggregate effect of thousands of PSL-crazed customers -- we're going to look at the lifecycle of a single customer.

Consider these two types of customers -- both very different from one another. The first profile is what you'd expect to see for a customer with low aquisition cost and low setup activity -- an application like Meta might have a profile like this, especially when you expect users to keep consuming at a steady rate.

Compare this to a customer with high acquisition cost and high setup activity -- this could be a customer with a long sales lead time or one with a lot of setup work, configuration, training, or the like. Typically, these are customers with a higher per-unit revenue model to support this work. Examples of a profile like this could be sales of high-dollar durable goods, financial instruments like loans, or insurance policies, and so on.

So, What?

Why, exactly, would we care about a profile like this? I believe this sort of model facilitates some important conversations about how a company is allocating software spend relative to customer activity, costs, and revenue, and I also think an understanding of this profile can help us understand the architectural needs of services supporting this lifecycle.

Note that this view of customer activity has some resemblance to a discipline called Customer Lifecyle Management (CLM). Often assoociated with Customer Relationship Mangement (CRM), CLM is typically used to understand activities associated with acquiring and maintaining a customer using five distinct steps: reach, acquisition, conversion, retention and loyalty.

Rather than just working to understand what's happening with your customer across the time axis, consider some alternative views. We can map costs and revenue per customer in a view like this, for example. Until recently, the cost to acquire a customer could be known only on an aggregate basis - computed based on total costs across all customers and allocated as an estimate. But as finops practices improve in fidelity and adoption, information like this could become a reality:

FinOps promises to give us the tools to track the investment in software you've purchased, SaaS tools, and of course, custom tools and workflows you've developed. Unless these tools are extremely mature in your organization, you'll likely need to allocate costs to compute these values. Also note that these costs are typically separated on an income statement - those occurring pre-sales tend to show up as Cost of Goods Sold (GOGS), while those after the sale are operating expenses (OpEx), and if you intend to allocate COGS, these will need to be interpolated, as none of the customers you don't close generate any revenue at all.

Less commonly considered is the architectural connection to a view like this. You might see a pattern like this if you're writing an insurance policy and then billing for it periodically:

Not only do these periods reflect different software needs, they also reflect different oppportunities to learn about your customer. Writing a new insurance policy is critically-important to an insurance company -- careful consideration is made to be sure the risk of the policy is worth the premium to be collected. For that reason, an insurance company will invest a lot of time and energy in this process, and much will be learned about the customer here:

On the other hand, the periodic billing cycle should be relatively uneventful for customer and carrier alike, and less useful information is found here -- at least when the process goes smoothly.

I believe the high-activity / high-learning area also suggests a high-touch approach to architecture. Specifically, I think this area is likely where highly-tailored software is appropropriate for most enterprises, and it also likely yeilds opportunities for data capture and process improvement. Pay attention to data gathered here and take care not to discard it if possible. Considering these activity profiles temporally may also lead us to identify separation among services, so in this case, the origination / underwriting service(s) are very likely different from the services we'd use after that customer is onboarded:

What's next?

I'm early in exploring this idea, but I expect to use this as one signal to influence architectural decisions. I expect that an acvity-over-time view of customers will likely help focus conversations about technologies, custom vs. purchased software, types of services and interfaces, and so on.

At a minimum, I think this idea is an important part of modeling services to match activity levels, as not all customers hit the same peaks at the same time.

When is an event not an event?

Events, event busses, message busses and the like are ubiquitous in any modern microservice architecture. Used correctly, events support scalability and allow services to evolve independently. Used poorly, though, events create a slow, messy monolith with none of those benefits. Getting this part of your architecture right is crucial.

The big, business-facing events are easiest to visualize and understand - an "order placed" event needs no introduction. This is the sort of event that keeps a microservice architecture flexible and decoupled. The order system, when it raises this event, knows nothing about any consumers that might be listening, nor what they might do with the event.

sequenceDiagram participant o as Order Origination participant b as Event Bus participant u as Unknown Consumer o ->> b : Order Placed b ->> u : Watching for Orders

Also note that this abstract event, having no details, also doesn't have a ton of value, nor does it have much potential to create conflicts if it changes. This is the "hello, world" of events, and like "hello, world", it won't be too valuable until it grows up, but it represents an event in the truest sense -- it documents an action that has completed.

A more complete event will carry more information, but as you design events, try to keep this simple model in mind so you don't wind up with unintended coupling system-to-system when events become linked. This coupling sneaks into designs and becomes technical debt -- in many cases befor you realize you've acrued it!

The coupling of services or components is never as black & white as the "hello, world" order placed event - it's a continuum representing how much each service knows about the other(s) involved in a conversation. The characterizations below are a rough tool to help understand this continuum.

When I think about the characterization of events in terms of coupling, the less coupling you see from domain-to-domain, the better (in general).

Characteristics of high coupling:

  • Events produced with no knowledge of how or where they will be produced. Note that literally not knowing where an event is used can become a big headache if you need to update / upgrade the event. Typically, you'd like to see that responsibility fall to the event transport rather than the producing domain, though.
  • A single event may be consumed by more than one external domain. In fact, the converse is a sign of high coupling.

Characteristics of high coupling:

  • Events consumed by only one external domain. I consider this a message bus-style event - you're still getting some of the benefits of separating domains and asynchronous processing, but the domains are no longer completely separate -- at least one knows about the other.
  • Events that listen for responses. I consider these to be rpc-style events, and in this case, both ends of the conversation know about the other and depend to some extent on the other system. There are places where this pattern must be used, but take care to apply it sparingly!

If you see highly-coupled events - especially rpc-style events, consider why that coupling exists, and whether it's appropriate. Typically, anything you can do to ease that coupling will allow domains to move more freely with respect to one another - one of the major reasons you're building microservices in the first place.


A little more on events: State vs Events

Team Topologies and Conway’s Law

Most of the reading I've done lately has been combinatorial in nature; that is, each book I pick up seems to reference another book I feel compelled to add to my reading list. My latest, Team Topologies, has been no different. This book has a pretty dry title but a really compelling pitch: it aims to help companies finally sort their org charts.

The book is structured with a "choose your own adventure" feel with an introductory section and a series of deeper dives that can apply more or less to different scenarios. In this way, a reader can skip straight to a section that's likely to apply to a problem they're experiencing. In working through the the first couple chapters, though, I was already experiencing "aha" moments.

A central source of inspiration for the ideas in Team Topologies is Conway's Law, which features prominently in the foundational chapters, Conway's Law suggests that organizations build systems that mimic their communication channels, so Team Topologies authors Matthew Skelton and Manuel Pais believe that the human factors of organizational design and the technical factors of systems design are inexorably intertwined, and in digesting this book, I found myself reflecting on scenarios that bear out this premise.

If there's a vulnerability in this book at all, it's that intertwining of organization and technical design. Strictly speaking, it's not a problem with the book at all; rather, it's an inexorable implication of Conway's Law. To the extent you buy into that relationship, it means that no software organization problem is merely technical and no organizational problem can be fixed by sliding names around on an org chart. Ostensibly, this may make change seem harder. In reality, I believe it points out that no change you may have believed easy was ever really changed at all.

Author Allan Kelly has done some writing on Conway's Law, as well, and I believe his assertion here carries deep implications:

More than ever I believe that someone who claims to be an Architect needs both technical and social skills, they need to understand people and work within the social framework. They also need a remit that is broader than pure technology -- they need to have a say in organizational structures and personnel issues, i.e. they need to be a manager too.

https://www.allankellyassociates.co.uk/archives/1169/return-to-conways-law/

Perhaps this connection of technology and organization helps explain the tremendous challenges found in digital transformation. I'm reminded of a conference speech I attended several years ago. The speaker was a CTO / founder of a red-hot technical startup, and he was sharing his secrets to organizational agility. "Start from scratch and do only agile things," to paraphrase, but only slightly. Looking around the room at the senior leaders from very large corporations in attendance, I saw almost exclusively despair. While a neat summation of this leader's agile journey, that approach can't help affect change in the slightest.

Does Team Topologies help this problem? Indirectly, I believe it can. Overwhelmingly, just highlighting the deep connection between people, software, and flow is a huge insight. Lots of other books and presentations kick in to address flow, but the background and earlier research docuented by Skelton and Pais in the early chapters of the book helped create a deeper sense of "why" for me. The lightbulb moment may be trite, but it's appropriate for me in this case. I found the book really eye-opening, and it's earned a spot on my short list of books for any business or technical leader.

Agile Enterprise, Part 1

I’ve recently had occasion to see the same challenge pop up in a couple different settings -- namely, bridging the gap between agile development and integration into an enterprise.  In both of the cases I saw, development employing agile practices did a good job of producing software, but the actual release into the enterprise wasn’t addressed in sufficient detail, and problems arose - in one case, resulting in wadding up a really large-scale project after a really large investment in time, energy, and development.

Although the circumstances, scope, and impact of problems like this vary from one project to the next, all the difficulties I’ve seen hinge on one or more of these cross-cutting areas: roadmap, architecture, or testing / TDD / CI.

Of the three of these, roadmap issues are by far the most difficult and the most dangerous of the three.  Creating and understanding the roadmap for your enterprise requires absolute synchronization and understanding across business and IT areas of your company, and weaknesses in either area will be exposed.  Simply stated, software can’t fix a broken business, and business will struggle without high-quality software, but business and software working well together can power growth and profit for both.

Roadmap issues, in a nutshell, address the challenge of breaking waterfall delivery of enterprise software into releases small enough to reduce the risk seen in a traditional delivery model.  The web is littered with stories of waterfall projects that run years over their schedule, finally failing without delivering any benefit at all. Agile practice attempts to address this risk with the concepts of Minimum Viable Product (MVP) and Minimum Marketable Feature (MMF), but both are frequently misunderstood, and neither, by themselves, completely address the roadmap threat.

MVP, taken at its definition, is really quite prototype-like.  It’s a platform for learning about additional features and requirements by deploying the leanest possible version of a product.  This tends to be easier to employ in green-field applications -- especially ones with external customers -- because product owners can be more willing to carve out a lean product and “throw it out there” to see what sticks.  Trying to employ this in an enterprise or with a replace / rewrite / upgrade scenario is destined to fail.

Addressing the same problem from a slightly different vector, MMF attempts to define a more robust form of “good enough”, but at a feature-by-feature level.  In this case, MMF describes functionality that would actually work in production to do the job needed for that feature -- a concept much more viable for those typical enterprise scenarios where partial functionality just isn’t enough.  Unfortunately, MMF breaks down when you compound all the functionality for each feature by all the features found in a large-scale enterprise system. Doing so really results in vaulting you right back into the huge waterfall delivery mode, with all its inherent pitfalls.

In backlog grooming and estimation, teams look for cards that are too big to fit into a sprint -- these cards have to be decomposed before they can fit into an agile process.  In the same way, breaking huge projects down by MVP or MMF also must occur with a consideration for how release and adoption will occur, and releasing software that nobody uses doesn’t count!

Architects and developers recognize this sticking point, because it’s the same spot we get into with really large cards.  When we decompose cards, we look for places to split acceptance criteria, which won’t always work for enterprise delivery, but with the help of architecture, it may be possible to create modules that can be delivered independently.  Watch for that topic coming soon.

Armed with all the techniques we can bring to bear to decompose large-scale enterprise software, breaking huge deliveries into an enterprise roadmap will let you and your organization see software delivery as a journey more than as a gigantic event.  It’s this part that’s absolutely critical to have embraced by both business and IT. The same partnership you’re establishing with your Product Owners in agile teams has to scale up to the whole enterprise in order for this to work. The trust you should be building at scrum-team-scale needs to scale up to your entire enterprise.  No pressure, right? Make no mistake -- enterprise roadmap planning must be visible and embraced at the C-level of your enterprise in order to succeed.

Buy-in secured, a successful roadmap will exhibit a couple key traits.  First, the roadmap must support and describe incremental release and adoption of software.  Your architecture may need attention in order to carve out semi-independent modules that can work with one another in a loosely-coupled workflow, and you absolutely need to be able to sequence delivery of these modules in a way that lets software you’ve delivered yield real value to your customers.  The second trait found in a roadmap of any size at all is that it’s nearsighted: the stuff closer to delivery will be much more clearly in-focus than the stuff further out. If you manage your roadmap in an agile spirit, you’ll find that your enterprise roadmap will also change slightly over time -- this should be expected, and it’s a reflection of your enterprise taking on the characteristics of agile development.

Next up, I’ll explore some ways architecture can help break that roadmap into deliverable modules.

 

Related

To agility and beyond: The history—and legacy—of agile development

 

Why Agile Fails in Large Enterprises

The Long, Dismal History of Software Project Failure