Showing posts with label risk management. Show all posts
Showing posts with label risk management. Show all posts

Thursday, June 26, 2008

SSIS: Unit Testing

I've spent the past couple days putting together unit tests for SSIS packages. It's not as easy to do as it is to write unit & integration tests for, say, typical C# projects.

SSIS Data flows can be really complex. Worse, you really can't execute portions of a single data flow separately and get meaninful results.

Further, one of the key features of SSIS is the fact that the built-in data flow toolbox items can be equated to framework functionality. There's not so much value in unit testing the framework.

Excuses come easy, but really, unit testing in SSIS is not impossible...

So meaningful unit testing of SSIS packages really comes down to testing of Executables in a control flow, and particularly executables with a high degree of programability. The two most significant control flow executable types are Script Task executables and Data Flow executables.

Ultimately, the solution to SSIS unit testing becomes package execution automation.

There are a certain number of things you have to do before you can start writing C# to test your scripts and data flows, though. I'll go through my experience with it, so far.

In order to automate SSIS package execution for unit testing, you must have Visual Studio 2005 (or greater) with the language of your choice installed (I chose C#).

Interestingly, while you can develop and debug SSIS in the Business Intelligence Development System (BIDS, a subset of Visual Studio), you cannot execute SSIS packages from C# without SQL Server 2005 Developer or Enterprise edition installed ("go Microsoft!").

Another important caveat... you CAN have your unit test project in the same solution as your SSIS project. Due to over-excessive design time validation of SSIS packages, you can't effectively execute the SSIS packages from your unit test code if you have the SSIS project loaded at the same time. I've found that the only way I can safely run my unit tests is to "Unload Project" on the SSIS project before attempting to execute the unit test host app. Even then, Visual Studio occassionally holds locks on files that force me to close and re-open Visual Studio in order to release them.

Anyway, I chose to use a console application as the host app. There's some info out there on the 'net about how to configure a .config file borrowing from dtexec.exe.config, the SSIS command line utility, but I didn't see anything special in there that I had to include.

The only reference you need to add to your project is a ref to Microsoft.SqlServer.ManagedDTS. The core namespace you'll need is

using Microsoft.SqlServer.Dts.Runtime;

In my first case, most of my unit testing is variations on a single input file. The package validates the input and produces three outputs: a table that contains source records which have passed validation, a flat output file that contains source records that failed validation, and a target table that contains transformed results.

What I ended up doing was creating a very small framework that allowed me to declare a test and some metadata about it. The metadata associates a group of resources that include a test input, and the three baseline outputs by a common URN. Once I have my input and baselines established, I can circumvent downloading the "real" source file, inject my test source into the process, and compare the results with my baselines.

Here's an example Unit test of a Validation executable within my SSIS package:

[TestInfo(Name = "Unit: Validate Source, duplicated line in source", TestURN = "Dupes")]
public void ValidationUnitDupeLineTest()
{
using (Package thePackage = _dtsApp.LoadPackage(packageFilePath, this))
{
thePackage.DelayValidation = true;
DisableAllExecutables(thePackage);
EnableValidationExecutable(thePackage);
InjectBaselineSource(GetBaselineResource("Stage_1_Source_" + TestURN), thePackage.Variables["SourceFilePath"]);
thePackage.Execute(null, null, this, null, null);
string errorFilePath = thePackage.Variables["ErrorLogFilePath"].Value as string;
//throw new AbortTestingException();
AssertPackageExecutionResult(thePackage, DTSExecResult.Failure);
AssertBaselineAdjustSource(TestURN);
AssertBaselineFile(GetBaselineResourceString("Baseline_Stage1_" + TestURN), errorFilePath);
}
}

Here's the code that does some of the SSIS Package manipulation referenced above:


#region Utilities
protected virtual void DisableAllExecutables(Package thePackage)
{
Sequence aContainer = thePackage.Executables["Adjustments, Stage 1"] as Sequence;
(aContainer.Executables["Download Source From SharePoint"] as TaskHost).Disable = true;
(aContainer.Executables["Prep Target Tables"] as TaskHost).Disable = true;
(aContainer.Executables["Validate Source Data"] as TaskHost).Disable = true;
(aContainer.Executables["Process Source Data"] as TaskHost).Disable = true;
(aContainer.Executables["Source Validation Failure Sequence"] as Sequence).Disable = true;
(aContainer.Executables["Execute Report Subscription"] as TaskHost).Disable = true;
(thePackage.Executables["Package Success Sequence"] as Sequence).Disable = true;
(thePackage.Executables["Package Failure Sequence"] as Sequence).Disable = true;
}


protected virtual void DisableDownloadExecutable(Package thePackage)
{
Sequence aContainer = thePackage.Executables["Adjustments, Stage 1"] as Sequence;
TaskHost dLScriptTask = aContainer.Executables["Download Source From SharePoint"] as TaskHost;
dLScriptTask.Disable = true;
}


protected virtual void EnableValidationExecutable(Package thePackage)
{
Sequence aContainer = thePackage.Executables["Adjustments, Stage 1"] as Sequence;
TaskHost validationFlow = aContainer.Executables["Validate Source Data"] as TaskHost;
validationFlow.Disable = false;
}

protected virtual void EnableValidationExecutable(Package thePackage)
{
Sequence aContainer = thePackage.Executables["Adjustments, Stage 1"] as Sequence;
TaskHost validationFlow = aContainer.Executables["Validate Source Data"] as TaskHost;
validationFlow.Disable = false;
}



Another really handy thing to be aware of...

IDTSEvents

I highly recommend you implement this interface and pass it into your packages. Of course, in each event handler in the interface, implement code to send reasonable information to an output stream. Notice the call to thePackage.Execute, way up in the first code snippet... the class that contains that method implements that interface, so I can manipulate (when necessary) how to handle certain events.

Interestingly, I haven't needed to do anything fancy with that so far, but I can imagine that functionality being very important in future unit tests that I write.

Here's a visual on all the resources... the image shows SSMS over VS, with both database tables and project resources with common URNs to relate them.



I won't get into the details of the framework functionality, but I found it useful to be able to do things like set a flag to rebuild baseline resources from current outputs, and such.

I modeled some of my framework (very loosely) functionality on the Visual Studio Team System Edition for Testers, which we used on the TWM ISIS project.

Another interesting lesson learned: I can see that the folks who built SSIS were not avid unit testers themselves. SSIS Executables have a "Validate()" method. I encountered lots of problems when I tried to use it. Hangs, intermittent errors, all that stuff that testing should have ironed out.

Saturday, May 17, 2008

Artless Programming

So maybe I am strange... I actually have printed snips of source code and UML diagrams and hung them on my office wall because I found them inspirational.

Reminds me of a quote from The Matrix movies...
Cypher [to Neo]: "I don't even see the code. All I see is blonde, brunette, red-head." :)

It's not quite like that, but you get the point. There's gotta be a back-story behind the witty writing. I suspect it has something to do with a programmer appreciating particularly elegant solutions.

One of the hard parts about knowing that programming is an artful craft is being forced to write artless code. It happens all the time. Risks get in the way... a risk of going over budget, blowing the schedule, adding complexity, breaking something else.

It all builds up. The reality is, as much as we software implementers really want application development to be an art, our business sponsors really want it to be a defined process.

The good news for programmers is that every application is a custom application.

It really sucks when you're surgically injecting a single new business rule into an existing, ancient system.

This is the case with one of my current clients. At every corner, there's a constraint limiting me. One false move, and whole subsystems could fail... I have such limited visibility into those subsystems, I won't know until after I deploy to their QA systems and let them discover it. If I ask for more visibility, we risk scope creep. The risks pile up, force my hand, and I end up pushed into a very tightly confined implementation. The end result is awkward, at best. It's arguably even more unmaintainable.

These are the types of projects that remind me to appreciate those snips of inspirational code.

Don't get me wrong. I'm happy there's a fitting solution within scope at all. I'm very happy that the client's happy... the project's under budget and ahead of schedule.

The "fun" in this case, has been facing the Class 5 rapids, and finding that one navigable path to a solution.

See also:
politechnosis: Art & Science

Sunday, May 4, 2008

Multiprocessing: How 'bout that Free Lunch?

I remember reading an article, a few years back...

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

Its tagline: "The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency."

Mr. Sutter's article suggests that because CPUs are now forced to improve performance through multi-core architectures, applications will need to typically employ multi-threading to gain performance improvements on newer hardware. He made a great argument. I remember getting excited enough to bring up the idea to my team at the time.

There are a number of reasons why the tag line and most of its supporting arguments appeared to fail, and in retrospect, could have been predicted.

So in today's age of multi-core processing, where application performance gains necessarily come from improved hardware throughput, why does it still feel like we're getting a free lunch?

To some extent, Herb was right. I mean, really, a lot of applications, by themselves, are not getting as much out of their host hardware as they could.

Before and since this article, I've written multi-threaded application code for several purposes. Each time, the threading was in UI code. The most common reason for it: to monitor extra-process activities without blocking the UI message pump. Yes, that's right... In my experience, the most common reason for multi-threading is, essentially, to allow the UI message pump to keep pumping while waiting for… something else.

But many applications really have experienced significant performance improvements in multi-processor / multi-core systems, and no additional application code was written, changed, or even re-compiled to make that happen.

How?


  • Reduced inter-process contention for processor time
  • Client-server architectures (even when co-hosted, due to the above)
  • Multi-threaded software frameworks
  • Improved supporting hardware frameworks
Today's computers are typically doing more, all the time. The OS itself has a lot of overhead, especially Windows-based systems. New Vista systems rely heavily on multi-processing to get performance for the glitzy new GUI features.

The key is multi-processing, though, rather than multi-threading. Given that CPU time is a resource that must be shared, having more CPUs means less scheduling collision, less single-CPU context switching.

Many architectures are already inherent multi-processors. A client-server or n-tier system is generally already running on a minimum of two separate processes. In a typical web architecture, with an enterprise-grade DBMS, not only do you have built-in “free” multi-processing, but you also have at least some built-in, “free” multi-threading.

Something else that developers don’t seem to have noticed much is that some frameworks are inherently multi-threaded. For example the Microsoft Windows Presentation Foundation, a general GUI framework, does a lot of its rendering on separate threads. By simply building a GUI in WPF, your client application can start to take advantage of the additional CPUs, and the program author might not even be aware of it. Learning a framework like WPF isn’t exactly free, but typically, you’re not using that framework for the multi-threading features. Multi-threading, in that case, is a nice “cheap” benefit.

When it comes down to it, though, the biggest bottlenecks in hardware are not the processor, anyway. The front-side bus is the front-line to the CPU, and it typically can’t keep a single CPU’s working set fresh. Give it a team of CPUs to feed, and things get pretty hopeless pretty quick. (HyperTransport and QuickPath will change this, but only to the extent of pushing the bottle necks a little further away from the processors.)

So to re-cap, to date, the reason we haven’t seen a sea change in application software development is because we’re already leveraging multiple processors in many ways other than multi-threading. Further, multi-threading options have been largely abstracted away from application developers via mechanisms like application hosting, database management, and frameworks.

With things like HyperTransport (AMD’s baby) and QuickPath (Intel’s), will application developers really have to start worrying about intra-process concurrency?

I throw this one back to the Great Commandment… risk management. The best way to manage the risk of intra-process concurrency (threading) is to simply avoid it as much as possible. Continuing to use the above mentioned techniques, we let the 800-lb gorillas do the heavy lifting. We avoid struggling with race conditions and deadlocks.

When concurrent processing must be done, interestingly, the best way to branch off a thread is to treat it as if it were a separate process. Even the .NET Framework 2.0 has some nice threading mechanisms that make this easy. If there are low communications needs, consider actually forking a new process, rather than multi-threading.

In conclusion, the lunch may not always be free, but a good engineer should look for it, anyway. Concurrency is, and will always be an issue, but multi-core processors were not the event that sparked that evolution.