Your Enterprise Architect

Monday, January 30, 2012

WIF and Claims Based Security Tokens

Windows Identity Foundation

Security is really a deceptively simple problem of merely two parts, Authenticating and Authorizing users. The problem is that there's been an absence of standards and interoperability is really impossible without them. However there's been quite a few advancements recently that I wanted to point out and discuss in this article.

Standard Challenges

In a typical security scenario, an application either contains a custom forms based authenticating systems (web) that uses a username-email/password combination to authenticate the user or it relies on the corporate authenticating system (intranet) to do the same. The application then queries their user repository to retrieve user information to determine their level of access, preferences and anything else that might be needed. The user repository is typically local to the application (a database for example), or the user directory managed by the authenticating system (active directory) if minimal information needs to be stored (role information for example). This information is then used to authorize the users level of access to the system.

Most applications require more information than just a username and password to customize the user experience. In order to provide the convenience of pre-filling forms, information such as the personal, business and shipping address is also usually needed. Business applications typically need role information that is more fine grained than the roles stored in the directory. These roles are used to customize access and provide the right features to enhance the user experience and ensure that the right users have access to the features they need to do their work efficiently. Querying a repository each time this information is needed can be cumbersome especially when that directory is not local to the system. The other option is to store specific user information within the applications native repository. The problem with that is that this information is typically duplicated and distributed in the various places this information is needed. Managing this distributed system can be quite a challenge.

Claims based identity model

Claims based security is based on the concept that the identity of a person can be represented by a set of claims about that person. A claim is a bit of information that describes the person in a way that is digitally useful. Claims typically contain the usual user name, or email, but they can include much more. Information such as roles, phone numbers, zip codes, addresses, anything that is typically used by applications to customize the user experience. An authenticating system creates a security token containing those claims and applications can customize their applications using the content contained in those claims. Such a system is especially beneficial in a federated environment, the federated system can construct the claims token within the local network and make that information available to a system external to that network.

There are several token standards such as the Simple Web Token (SWT), or the XML-based Security Markup Language (SAML), there's currently a proposal out to create a standard JSON formatted security token called the JSON Web Token (JWT). There were several drivers for creating these standards, cookies were proving insufficient to contain all of this information due to their size limitations. Federation requires a standard way to communicate security information across network boundaries and Web Services need a common security language to communicate between the communicating parties.

So who issues these tokens?
There are many systems that issue these tokens, they are as far ranging as the Kerberos driven domain controllers in your windows network to the Facebook token that's generated when you log in to facebook. Google, Yahoo, Microsoft Live all generate claims based security tokens. These tokens are digitally signed by the issuing authority, and it's up to the application to determine if it trusts the issuer.

The System

The systems that issue these tokens are called security token services (STS), they build the token, sign it and then return it to calling application. They systems follows these steps to generate the token

In this step the user interacts with an application
The application requests an Claims based token from the STS
The STS authenticates the user and queries the directory for claims relating to the user.
The STS builds the Token and returns it to the application, the application uses the claims to authorize the user.

The application can be configured to accept tokens from various trusted issuers. If the application trusts the issuer it can assume that the claims contained in the token are correct and authorize the users level of access. The benefit of this is that the responsibility of authenticating and assembling user information is removed from the application. The user information is centralized and easier to manage. This does assume that an administrator or administrative system has to configured STS to with the claims the application will need. This is not an extra step in the system, because this would have had to be done anyway, the difference is the place that this is done. The information is centralized and in a complex system, easy to manage.

By organizing the system in this way, STS becomes an expert system that knows everything about users, knows how to secure the user information and knows how to secure the transmission of that information.

Implementing all of this is a set of .Net libraries called Windows Identity Foundation, these libraries make it easy to integrate STS into your application.
With WIF you'll be able to receive a claims based token, verify the signature and read the claims it contains. WIF supports tokens created using SAML 1.1 or SAML 2.0 formats. ADFS 2.0 and ACS are able to issue tokens in those formats and WIF can work with those issuers as well as other STSs as long as they produce tokens using SAML version 1.1 and higher.

STSs communicate using WS-* standards, policy is retrieved using HTTP GET technology. The claims token is written using the SAML standard mentioned above and all of that together provides an interoperable way for STSs to communicate with each other. This is especially useful in federated system, where STSs in different environments and on different platforms can communicate with each other.

In the next few blogs I'll show example solutions implementing WIF in a federated environment as well as using WIF to integrate with Windows Live, and Facebook.

Sunday, December 18, 2011

Building a windows service that never fails

A process that runs as a windows service is meant to provide a service or do work without stopping, it should always be there and provide that service or do the work when it needs to be done, it should never fail. These services could be running mission critical processes, executing trades for banks, or transactions for e-commerce platforms, insurance claims for insurers, the list goes on. Almost all companies have these systems and the impact a failed service can have on a company and resources can be significant. People either having to scramble to work through a backlog, or are being called at out of office hours to restart a failed system.

These services fail all to often though, and for really bad reasons. For mature products (that have gone through a thorough QA cycle) the cause is most often the unavailability of another resource. This can be a database that's busy and is generating timeout errors, or has gone temporarily offline ( a reboot or similar), an external webservice that's not available or any number of other resource with availability issues.

In this blog I'm going to show you how to build high availability windows services that can deal with these and many other issues, recover or even fix failures and continue to execute their mission critical task with minimal delay.

I started working on this problem at Merrill Lynch. The director I was working for suggested at one of the many design meetings I had with him, that I figure out a way to design a service pattern that could withstand temporary db, network and other service outages. That initial design has evolved many iterations, this one is I think the most concise and cleanest. It can be adapted to work outside of the service framework as well, by incorporating it into an exception management framework for example.

The Service State Engine

This solution uses the 'State Engine' or 'State' design pattern from the 'Gang of Four' to solve the problem. A services is typically in the 'Running' State, but when exceptions that can't be handled occur, the service changes into it's 'Diagnostic' State. In this state, the service tries to determine what happened, it may probe the environment to come to a conclusion, but typically, a regular expression analysis of the exception message is enough to diagnose the error. The state of the service then changes to the 'Recovery' state. Diagnostic information is provided to the recoverer and depending on the diagnosis, it can take different actions to get the system back to the 'Running' state. In the case of outage for example, the recovery state may simply wait until the outage is resolved and then resumes work by changing back to the 'Running' state. Often the Recoverer will actively take steps to 'fix' a problem, this could be starting an external service that it depends on, or any other type of action that a user would have to take to get a system back in working condition.

Design & Implementation

This solution is easy to apply to any given service because the RunningState is essentially the service as it would be implemented without the Diagnostic/Recover states. By separating out the Diagnostic and Recovery States into their own classes we give them a single responsibility, it keeps the solution clean and ensures that the states don't muddle each other logic. Transitions from one state to another are managed through the ServiceController. The Service States fire events as necessary and the controller receives them and assigns a new state to manage the events as necessary (shown below).

Services tend to follow a set pattern, a list of work items is retrieved and each item is then processed, when the list of items is complete, a new batch of work items is retrieved and worked on, ad infinitum. Here's an example for what the RunningState's start() method would look like in a generic sense.

       public void Start()
        {
            try
            {
                Queue workList = getWorkItems();
                while (workList.Count > 0)
                {
                    try
                    {
                        var workItem = workList.Dequeue();
                        doWork(workItem);
                        if (workList.Count = 0)
                            workList = getWorkItems();
                    }
                    catch (Exception ex)
                    {
                        logger.Error("Exception encountered during gateway processing", ex);
                        SendStatusEvent(new StateEventArgs { state = StateEvent.Errored, stateDetails = ex });
                    }
                }
            }
            catch (Exception outerEx)
            {
                logger.Error("Outer exception raised", outerEx);
                SendStatusEvent(new StateEventArgs { state = StateEvent.Errored, stateDetails = outerEx });
            }
        }

This type of solution needs two exception catchers, one for the inner loop that works on each work item, and then an outer catchall that catches anything else. The inner catch could also have been incorporated into the 'doWork()' method. The inner catch block ensures that the remaining work items in the list get processed.
When an exception is caught, it is logged and the StatusChangeEvent is fired to be caught by the controller.

The code below is a snippet from the service controller, the code snippet is really the heart or brains of the controller. When the states fire events the controller intercepts them and assigns a new state manager as necessary to deal with the event. Line 5 shows the controller handling an error event, it changes state to the diagnosticState and sets the exception property for the diagnostic to analyze. The responsibility of the diagnostic is as its name implies to diagnose the error. It is usually enough to just analyze the error message, but it may sometimes be necessary to probe the environment for more information if the error string doesn't provide enough information for a complete diagnosis.

void ServiceStateBase_StatusChanged(object sender, EventArgs<StateEventArgs> e)
{
    switch (e.EventData.state)
    {
        case StateEvent.Errored: ChangeState(diagnosticState);
            diagnosticState.SetException((Exception) e.EventData.stateDetails);
            diagnosticState.Start();
            break;
        case StateEvent.Diagnosed: ChangeState(recoveryState);
            recoveryState.Diagnosis = (Diagnostic) e.EventData.stateDetails;
            recoveryState.Start();
            break;
        case StateEvent.Recovered: ChangeState(runningState);
            runningState.Start();
            break;
        case StateEvent.Stopped : ChangeState(runningState);
            //doesn't call the start automatically, waits for user to do so.
            break;
               
    }
}

The diagnosis of an error is very specific to the environment, dependencies and logic of the solution that has been implemented. Solutions tend to exhibit failure trends that start to show themselves as soon as a product goes into integration testing, by the time the product goes into production, you'll have encountered most of the failure scenarios. As you encounter them, you should update the diagnostic to be able to recognize them, and the recoverer to be able to deal with them. The code snippet below deals with the most common errors I've discussed above. We first deal with the sudden unavailability of external systems. With regard to the sample solution that's a companion to this blog that would be the database and messagebus. We can recognize those issues by simply looking at the exception type or analyzing the exception message.

protected void Diagnose(Exception e)
{
    if (MessageBusError(e))
    {
        diagnosis = Diagnostic.MessageBusUnavailable;
    }
    else if (DatabaseError(e))
    {
        diagnosis = Diagnostic.DatabaseError;
    }
    else if (errorsInInterval() < 100)
    {
        diagnosis = Diagnostic.Unknown;
    }
    else
    {
        logger.Fatal("Fatal error occured", exception);
        diagnosis = Diagnostic.FatalForUnknownReasons;
    }
}

Other exception may occur and I've added a counter and interval check to recognize a sudden surge in exceptions that could indicate a new failure scenario that's as yet unhandled. If that's the case, the system will report it and diagnose the error as a Fatal error.

Recovery from database failure is very simple, you wait a set amount of time and try test the connectivity, when the database is available again a "recovered" event is fired and the controller sets the state to the "runningState" again. This recovery scenario applies to most external systems that go offline for some reason.

In "real life" scenarios, I've had to deal with external services that would periodically fail, but who I could programatically restart from within the service I was building. In those instances, the recoverer would execute the restart logic and a previous failure that would have killed the service was recognized and fixed within milliseconds of occurring. In fact this previously common failure scenario was all but considered fixed, even though the issue kept recurring, but it no longer had an impact on dependent services. By far the most common 'recovery' scenario though is simply waiting for an external system to come back online.

The beauty of this system is that it really works well, in fact it works so well that it's sometimes easier to recover from a defect in another system than it is to fix that system. And that is something that you do have to keep in mind, not to just treat the symptoms but to still go after the root cause. However, the benefits to having a system that can solve problems, instead of just quitting is that it creates a much more stable environment. In an enterprise of dependent systems failures occur often and it's nice not to be called out of bed to restart a service and try to figure out what happened. Rebooting a machine or restarting a database no longer requires the coordinated effort of an operations team spread out over many divisions when the services that depend on it can deal with the temporary outage.

The entire solution can be found in this blogs companion website, the source code repository at CodePlex. The LogService is built using the Service State Engine described in this blog.

Happy Coding.

Saturday, November 12, 2011

Enterprise solution for Logging

This blog and its companion blog are about building example solutions for a distributed architecture. The whole idea behind a distributed architecture is to loosely couple expert systems together using a messaging infrastructure. As examples for expert systems we use Notifications and Logging because almost all projects include them, and as such they are great candidates for being transformed into expert systems. This blog focuses on Logging.

Almost all development projects include a log solution. Projects typically choose log4j or log4net or a custom solution. For large companies these log solutions can cause problems. The benefits to logging is that they contain information about the running of an application, the work it has done and the problems it has encountered along the way. This information is valuable to the development teams to learn how the application behaved and what needs to be done to improvement it. This great benefit can potentially become a liability. In the wrong hands this information can be used to discover the applications vulnerabilities. Log systems also typically generate volumes of content. In an enterprise with a great many applications under management, this volume of information can be excessive and very expensive to store, organize and generally manage. Just making sure that developers configure their application according to the corporations standards requires many meetings and a lot of coordination. Disk capacity for Logs has to be ensured to avoid failure. Ultimately the typical log solution tightly couples the system to a component that is not mission critical.

In conclusion, logging needs to be a managed process, that is standardized, secured and monitored and decoupled from the primary application, leaving it do what it needs to do without having to worry about something that is not central to its task. This way each environment can be set up according to the needs of that environment. A development environment will save all log events, successive environments will log less until the product reaches production at which point only the most important messages are logged. Having a centralized logging system ensures that all environments behave in the same way, there's no need for coordination, app developers are given components to integrate with and are not allowed to write to disk. An expert system is required. As this is series on distributed expert systems, let's look at a solution that uses an advanced messaging systems at it's core.

As you can see from the image, the solution will end up including a notification system, but that will be described in another blog.

Applying the pattern to Logging

We've discussed advanced messaging before, as well as the ever so reliable product RabbitMQ that is based on the distributed telecom language Erlang. The message bus solution can be clustered and therefore scales well, it is very fast and reliable. It is also available as a cloud solution from VMWare. The advantage of using a message bus between an application and an expert system is that it completely decouples the two, there's one universal solution to logging, one standard that is set by the corporation and can include any number of components to ensure that the content is secured and far away from 'prying eyes'.

Putting such a system together is very easy. All it requires is a very simple Facade that simply hands off the logging functionality to the message bus to be picked up by the 'Expert' logging system. That expert system, at the very least is as complicated as the original solution (that was implemented umpteen times across all applications), but now that this is centralized we can focus a little more attention to making sure the content is secured - both in terms of content, as well as access.

Implementation

Before I start to code a solution, I usually try to define my namespaces by looking at my layer diagram. Logging is a utility, and there will be many other utilities, notifications, exception management, session management, caching, wmi instance logging, the list goes on. Most of these utilities are candidates for standardization even going so far as to create dedicated expert systems for. So a utilities namespace makes sense at a root level. The MessageBus is part of the infrastructure, much like a Database server is or web services that we may consume. So the infrastructure namespace will also exist off the root. The expert systems that provide these type of plumbing services I have aggregated into a 'Monitor' namespace, not sure if that's the right name, but it will do for now.

The layer diagram looks like this.

the logging interface, for the purposes of this example we define an extremely simple interface

namespace YEA.Utilities.Logging
{
    public interface ILog
    {
        void write(string message, LogType level);
    }
}

The Interface is implemented by a Logger class that implements the write method as follows.

        public void write(string message, LogType level)
        {
            var sc = new LogToMessageConverter(level);
            var publisher = _gf.GetPublisher(_publisherName, sc.ConvertObjectToMessage);
            publisher.Publish(message);
        }

And the serialization of the log message implemented by the following:

public static class LogPropertiesStandards
    {
        public const string LogLevelPropertyName = "LogLevel";
    }
    public class LogToMessageConverter : YEA.Infrastructure.Gateway.Converter.StringToMessageConverter
    {
        public LogType LogLevel { get; set; }
        public LogToMessageConverter(LogType level)
        {
            LogLevel = level;
        }
        public override Message ConvertObjectToMessage(RabbitMQ.Client.IModel channel, object packetToSend)
        {
            Message message = base.ConvertObjectToMessage(channel, packetToSend);
            message.Properties.Headers.Add(LogPropertiesStandards.LogLevelPropertyName, (int)LogLevel);
            message.RoutingKey = UtilitiesHelper.getMyProgramFullName();
            return message;
        }
    }

The converter derives from the simple string to message converter (line 5) and we get the message by calling the base class conversion method ( line 14). All that remains is to set a few properties and the routing key appropriately. The routing key is essentially the application namespace of the calling executable. The log directories on the receiving side will replicate the application namespace. If the namespace follows industry standards, the log files of each department will be located in the same subtree and securing and granting access to the subtree is simplified.
namespace: Company.Organization.Department.BusinessUnit.ApplicationGroup.Application will result in
directory: <drive>:Company/Organization/Department/BusinessUnit/ApplicationGroup/Application.Log The expert logging system uses log4net with dynamic configuration to save log entries to the directories show above. here's a code extract of the important bits.

using log4net;
using log4net.Repository.Hierarchy;
using log4net.Core;
using log4net.Appender;
using log4net.Layout;
using System;
using System.IO;
using System.Configuration;
using System.Text;

namespace YEA.Monitor.LogManager
{
    public static class Logger
    {
        private const string LOG_PATTERN = "%d [%t] %-5p %m%n";
        private const string ConfigDriveKey = "LogDrive";
        private static char Drive;

        static Logger()
        {
            Drive = GetDriveLetter();
            log4net.LogManager.GetRepository().Configured = true;
        }
        private static char GetDriveLetter()
        {
            var driveLetter = ConfigurationManager.AppSettings[ConfigDriveKey];
            if (driveLetter == null)
                throw new ConfigurationErrorsException(string.Format("Configuration key: {0} is expected", ConfigDriveKey));
            return Convert.ToChar(driveLetter);
        }

        static void CreateDir(string name)
        {
            Directory.CreateDirectory(name);
        }
        private static string ConvertToPath(string name)
        {
            if( string.IsNullOrWhiteSpace(name) )
                throw new ArgumentException("A valid name has to be supplied to the logging application");
            var names = name.Split('.');
            if (names.Length == 1)
                return Drive + @":\InCorrectNamespaceDump\" + name + ".log";

            var builder = new StringBuilder(Drive + @":");
            for (int i = 0; i < names.Length - 1; i++)
            {
                builder.Append(@"\");
                builder.Append(names[i]);
            }
            
            return builder.ToString();
        }
        static IAppender CreateAppender(string repositoryName)
        {
            PatternLayout patternLayout = new PatternLayout();

            patternLayout.ConversionPattern = LOG_PATTERN;
            patternLayout.ActivateOptions();

            var path = ConvertToPath(repositoryName);
            CreateDir(path);
            var builder = new StringBuilder(path);
            builder.Append(@"\");
            builder.Append(repositoryName.Substring(repositoryName.LastIndexOf('.')));
            builder.Append(".log");
            var fileName = builder.ToString();

            RollingFileAppender roller = new RollingFileAppender();
            roller.Name = repositoryName;
            roller.Layout = patternLayout;
            roller.AppendToFile = true;
            roller.RollingStyle = RollingFileAppender.RollingMode.Size;
            roller.MaxSizeRollBackups = 4;
            roller.MaximumFileSize = "100KB";
            roller.StaticLogFileName = true;
            roller.File = fileName ;
            roller.ActivateOptions();
            return roller;
        }

        public static ILog Get(string name)
        {
            //var h = CreateHierarchy(name);
            var log = log4net.LogManager.GetLogger(name);
            var l = (log4net.Repository.Hierarchy.Logger)log.Logger;

            if (l.GetAppender(name) == null)
            {
                l.RemoveAllAppenders();
                l.AddAppender(CreateAppender(name));
                l.Level = Level.All;
            }
            
            return log;
        }
    }
}

The interesting bit in this implementation of log4net is that it has to dynamically create an appender as a new application (represented by it's namespace) starts sending log messages. The method CreateAppender does that and the Get method assigns the appender if the Logger it has retreived doesn't have one.

So what happens in this scenario if the log directory accidentally get's full? The 'expert system' will likely fail. But because it is loosely coupled to the application that has sent the log request, thanks to the messagebus that sits between them, the application isn't affected. In fact it isn't even aware of the failure. The messagebus will continue to collect log requests, queueing them up until the 'expert system' has resolved the issue, at this point all log requests are processed and it's as if the failure never happened. The message bus will gives the system the buffer ( depending on it's disk capacity) it needs to fix the inevitable temporary outages. The other advantage is that maintenance windows can be implemented without forcing all dependent applications to stow working. All of this adds to a more stable system, that can tolerate failure without causing a chain reaction of failures downstream potentially affecting mission critical applications.

The added advantage of decoupling the logging system from the applications that use it, is that there's now no direct relationship between the location of the log files and the location of the applications that generate them. The two could be located on completely separate networks or domains with one not giving a clue to the location of the other. That makes it that much easier to prevent unwanted eyes from gaining access to an applications vulnerabilities.

Here is where you can find the entire source code solution discussed in this blog