Cristian Satnic | OdeToData

Deploying SQL Server schema changes using Flyway migration scripts

February 28, 2020February 28, 2020 by Cristian Satnic

I was recently involved in a project using PostgreSQL on Linux where my goal was to “sprinkle some devops” on the database development process by getting the database schema in source control and automating the schema change deployments. For such a task, in the Microsoft SQL Server ecosystem, I would usually start with SQL Server Data Tools (SSDT). It is a very nice addon included for free with the various tools surrounding SQL Server and it supports my preferred database schema deployments approach – state based deployments. However, for this particular project I was dealing with PostgreSQL and since SSDT does not work with it I had to do some research to see what my options were.

Just a quick aside here – when it comes to deploying schema changes for databases with a strongly enforced schema (this usually means relational database systems) there are two main approaches: state based (the one supported by SSDT above) and migrations based. The focus of my current article is not to define these or to present their pros and cons but I want to include here some excellent reference articles that cover those points very well. It’s important to understand these two approaches because that will make it much easier to relate to the tools you have available to follow
your preferred approach.

So back to my search – I wanted to see what tools I could find for schema deployment automation with PostgreSQL. Since SSDT is a free tool on the SQL Server side I was looking for free / open source options that I could use. I wasn’t able to find any solid open source tools that support the state based approach that SSDT follows and I wasn’t too surpised about it. In the state based approach, the tool used has to do the heavy work: here’s my schema version A and I’d like to get to schema version B … please generate the SQL scripts for me that will allow me to get there. Such a tool would be fairly complex to write and if such open source options would exist they would probably be focused on a particular database engine (since the generated SQL would be engine specific).

Most of the articles I looked at mentioned migration based tools for managing PostgreSQL schema deployments – with two such tools listed in pretty much every article I looked at:

https://www.liquibase.org/ – Liquibase
- https://github.com/liquibase/liquibase
https://flywaydb.org/ – Flyway
- https://github.com/flyway/flyway

Some quick points of comparison between the two:

both are open source tools with some commercial extensions for advanced features
both are backed by commercial entities (Liquibase by Datical and Flyway by Redgate)
both are command-line tools that are Java-based
Flyway supports migration scripts in plain SQL format while Liquibase supports additional formats such as XML, YAML and JSON (this is useful if you need to migrate schema changes between database systems that don’t use the same SQL dialect)

In the end, for my PostgreSQL needs, I decided to work with Flyway. Why? Here are some quick reasons:

the GitHub project for Flyway is liked by more than twice the number of followers that Liquibase has (this has to mean something, right?)
Flyway appeared to me to be the less complex of the two. I didn’t need the advanced features of Liquibase and I was ok with writing migration scripts in SQL (which is what Flyway supports).
Flyway recently received commercial support from Redgate. I’m very familiar with Redgate’s tools for SQL Server and the fact that they got behind this particular open source project is a good
sign in my book.

After I got Flyway working on PostgreSQL on Linux I started thinking about how Flyway might work with SQL Server on Windows. Why would this matter and why is it a good thing to know?

I said above that for SQL Server I prefer state based migrations using SSDT – but not everybody does. There are many DBAs working with SQL Server who don’t want to trust a tool (SSDT) to write the migration scripts so for them it’s more natural to use a migrations approach to schema deployments.
It’s good to be familiar with a database deployment tool that can be used across multiple database engines and operating systems.
It forced me to understand the deployment workflow needed when migrations are used. It’s always good to see multiple points of view and recognize their pros and cons.

So – how do we use Flyway migration scripts with SQL Server on Windows? Let’s proceed.

Go to https://flywaydb.org/download/ and download Flyway for Windows. It comes as a zip file so extract the archive and simply add the new flyway-6.2.4 directory to the PATH to make the flyway command available from anywhere on your system. Flyway is distributed with its own JAVA runtime environment (JRE) so you should be able to just type flyway in a command prompt to confirm you’re ready to use it. By default, when called without any parameters, flyway will give you a description of its execution options.

Let’s look at the Flyway directory structure.

By default, there are 2 folders you want to pay attention to initially. conf is the folder where the Flyway configuration file exists by default and sql is the folder where Flyway will look for SQL migration scripts it needs to execute. Both of these can be modified via configuration parameters but for now we’ll just work with the defaults.

Let’s look at flyway.conf and see what options we need to modify.

flyway.url – the database location (in JDBC format)
flyway.user – user info to use for authentication
flyway.password – the user’s password
flyway.locations – file system locations for the SQL migration files (if you don’t want to use the default /sql folder)

The options above (and many others) are very nicely documented in the configuration file so it should give you a good idea of what’s available and how to configure them.

One word about flyway.url: that config option contains the database name to use for the migration scripts so the database has to exist already. Since creating the database is typically a one-time operation it should be handled outside of the process that delivers the Flyway migrations. In my case, I’m using the following value to point to a local SQL Server database:

flyway.url=jdbc:jtds:sqlserver://localhost:1433/FlywayDemo

Note: the above format is not exactly the one suggested in the config file for SQL Server. I couldn’t get that one to work so I searched for alternate versions and found the one above (thank you StackOverflow).

With the basic configuration options out of the way let’s look now at the Flyway commands that we can work with. There aren’t that many (I did mention that Flyway is simple to use, right?):

migrate – Migrates the database
clean – Drops all objects in the configured schemas
info – Prints the information about applied, current and pending migrations
validate – Validates the applied migrations against the ones on the classpath
undo – [pro] Undoes the most recently applied versioned migration
baseline – Baselines an existing database at the baselineVersion
repair – Repairs the schema history table

Even without knowing much about Flyway or SQL migrations you can probably deduce (by everything you’ve seen above) how Flyway actually works. You, the developer, provide SQL scripts that modify the database schema based on your needs. The only requirement is that you have to name the scripts in a certain way (with version numbers) and place them in the folders that Flyway looks at. When you ask Flyway to migrate a target database schema it will look at the migration scripts it has available, it will figure out the schema version for the target database (using its own Flyway version history table) and it will execute the scripts that are not yet on the target. That’s pretty much all there is to it.

Let’s see what happens when we execute flyway info without any existing migration scripts.

Flyway will attempt to connect to the target and show information about the target’s schema version compared to any migration scripts it finds. In our case we don’t yet have any migration scripts to execute.

Let’s create a script to add a table. By default (and this can be configured – of course) Flyway expects migration scripts to follow a certain naming pattern in order to be picked up.

I created a SQL migration script file called V1__add_new_table.sql with the following contents:

Here’s what happens when we call flyway info (to see if it detects the new migration file), flyway migrate (to apply the migration), and flyway info again to see the result.

What happened on the SQL Server side?

We see the new table we just created via the migration along with Flyway’s own table for keeping track of schema versions – flyway_schema_history. Let’s see what’s inside that table:

This is pretty much the same information returned by the flyway info command. We looked here at how to get started with Flyway and how to create a versioned migration (those that start with V by default). Versioned migrations, once applied, should never be modified again. Flyway also has the concept of repeatable migrations (those files start with R) – these are migrations that will be executed in the future as long as Flyway detects that the migration file has been modified. Repeatable migrations are useful for scripts that can be easily re-executed (such as those that recreate procedures / views / functions or reinsert bulk data).

One tip that I found useful – how do you execute different migration scripts in different environments? Let’s say that in development you want certain sample data to be present but you don’t want the same data to be found in production. The trick to do that is to put common migration scripts in a common folder (for example) and then have different per-environment folders. When you execute migrations against development you’ll modify the flyway.locations parameter to run scripts from common and the development environment folders and so on for the other environments.

How do you generate the migration scripts to run? Flyway will not help there. It’s up to you to either create them manually or rely on some other tool that can produce the schema diff scripts for you. Flyway will simply execute the scripts it finds in the SQL folders.

How do you use Flyway in a CI/CD pipeline? That depends on the process and tool you use for CI/CD. You could install the Flyway executable directly with your CI/CD environment and use it that way, or, the way I used it with GitLab, you could rely on the fact that Flyway is distributed as a Docker container that you can easily pull in, execute the commands you need, and then throw it away.

Happy SQL schema migrations!

How to parse HL7 2.x messages stored in SQL Server using Python

February 24, 2019 by Cristian Satnic

I’ve recently had a need to parse HL7 2.x messages stored in SQL Server. If you don’t know what HL7 2.x healthcare messages look like then here’s a quick sample:

MSH|^~\&|HIS|MedCenter|LIS|MedCenter|20060307110114||ORM^O01|MSGID20060307110114|P|2.3
PID|||12001||Jones^John^^^Mr.||19670824|M|||123 West St.^^Denver^CO^80020^USA|||||||
PV1||O|OP^PAREG^||||2342^Jones^Bob|||OP|||||||||2|||||||||||||||||||||||||20060307110111|
ORC|NW|20060307110114
OBR|1|20060307110114||003038^Urinalysis^L|||20060307110114

Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.

For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.

I know I must get help from somewhere else. I’ve recently been experimenting with Python and with some quick Google searches I was convinced that Python had some interesting packages that could help with HL7 2.x parsing. Now – keep in mind that this is SQL Server 2016 so SQL Server Machine Learning Services is not available for this version … only for SQL Server 2017 and higher (this is the capability that would enable native execution of R and Python scripts directly in SQL Server T-SQL code). If I can’t run Python code natively in this version of SQL Server then I must execute this Python code from the outside.

I’ve also been looking at Jupyter Notebook recently – an awesome environment for interactive exploration of data sets using languages such as Python – so this was a good excuse to try to bring together all these technologies to enable me to look at HL7 data.

This is not a blog post about Jupyter but if you’d like to take a look at it the easiest way to do it is by installing the Anaconda Distribution which will bring together Jupyter, Python and a ton of other packages and frameworks that are super-useful for data analysis, data science and machine learning.

Once you have Jupyter installed this is the process we’ll follow to deal with HL7 data in SQL Server:

Connect in Python to SQL Server

import pandas as pd
import pyodbc

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=localhost;DATABASE=Training;Trusted_Connection=yes')

query = "SELECT * FROM HL7Messages"

Bring data from the SQL Server table into a pandas data frame (pandas is a well known package in Python that makes data exploration really easy)

df = pd.read_sql(query, sql_conn)

Combine the pandas data frame functionality with HL7 parsing in Python in order to filter HL7 messages as needed

import hl7

for index, row in df.iterrows():
    h = hl7.parse(row['HL7Message'])
    print("Message type: {}, Patient ID: {}".format(h.segment('MSH')[9], h.segment('PID')[3]))

… which produces the following output using the sample data I was looking at:

Message type: ORM^O01, Patient ID: 12001
Message type: ORU^R01, Patient ID: 999999999

The HL7 parsing functionality in Python is provided by this package:

https://python-hl7.readthedocs.io/en/latest/

If you’d like to follow along you can look at my GitHub repo below where I’ve provided some sample HL7 2.x messages, a SQL script to create a table with that data and Jupyter notebook files to directly execute the relevant Python code:

https://github.com/csatnic/PythonHL7Parsing

Strategy and tips for performance troubleshooting in SQL Server

December 31, 2016 by Cristian Satnic

There was probably a time in its early days when SQL Server could be considered a simple database engine – we’d run queries with SQL and we were probably pleasantly surprised when stuff just worked. Much has changed since then. Over time SQL Server has evolved into a complex relational database management system (RDBMS). If there is some task that involves data manipulation then SQL Server most likely has a component to handle that – data queries with SQL, multi-dimensional analysis, statistical analysis with R, reporting, integration with other data sources, ETL and so on.

On one hand it is amazing that given all this complexity there are plenty of workloads that work just well enough on their own (no full-time DBA needed) if enough hardware resources are made available.

On the other hand, there are certainly times when performance troubleshooting is required – either because SQL Server’s response got worse over time or because it’s just not at the level where it needs to be to keep users happy.

With all these system components fighting for limited hardware and infrastructure resources how is a DBA supposed to troubleshoot performance problems? When angry users are on the phone or when your boss is looking over your shoulder demanding quick results – that is hardly the time to come up to speed on the finer points of waits statistics analysis.

Below is a general strategy with some specific tips and tools to guide you on the road to a better SQL Server performance.

1. Understand the environment and see how SQL Server is configured (the sanity check)

This first step is very important especially when you have to troubleshoot an environment you’re not familiar with (for example when you’re a performance consultant) or when you work at a company where many people can make changes to the SQL Server environment and you’re not really sure if what was true last month about the configuration is still true today. It doesn’t make any sense to spend time looking for bad queries only to discover eventually that somebody made changes to the SQL Server memory configuration, confused megabytes with kilobytes and now SQL Server is running with much less memory available for its own use.

Here are some scripts to help in this discovery phase:

Brent Ozar and company have an awesome set of scripts in their “First Responder Kit” that’s now available on GitHub:

https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit

The main one that you want to get familiar with is sp_Blitz but certainly look over the other ones as well. sp_Blitz performs various checks in configuration settings and it will alert you when it discovers anything that’s not considered a best practice.

The other set of very useful diagnostic scripts are those provided by Glenn Berry:

http://www.sqlskills.com/blogs/glenn/category/dmv-queries/

He usually updates them every month and targets specific features in the various SQL Server versions available.

If all looks good with the environment and configuration then it’s time to move to the next step.

2. What is SQL Server doing right now?

There are probably many commercial tools that offer all sorts of monitoring capabilities but the one free tool that every SQL Server DBA should be familiar with is sp_WhoIsActive (this is my ‘don’t leave SQL Server without it’ tool):

http://whoisactive.com/

It is an awesome stored procedure by Adam Machanic that shows all sorts of relevant data about what exactly is going on at the transaction level when the stored procedure is executed. It has tons of flags and options – all meant to customize the output in order to facilitate a performance troubleshooting session.

Adam wrote a series of blog posts explaining all its various features – it’s well worth reading these in advance so you’re familiar with them before you need to use them under pressure:

http://sqlblog.com/blogs/adam_machanic/archive/tags/month+of+monitoring/default.aspx

It’s possible that when you’re looking strictly at what’s going on in the present you may not be able to determine why SQL Server is hurting. In that case you’ll probably need to go to the next step:

3. What is SQL Server waiting on during a certain time interval? (wait statistics analysis)

SQL Server is a complex system but it’s also a pretty good patient – it keeps detailed track of many vital statistics but you need to know where to look for that data. SQL queries, until they finish their execution, go through many different kinds of waits: waiting for CPU resources, waiting for data to be moved to memory from disk, waiting for data that other queries have locks on and so on. An analysis of all these waits over time can offer critical insight into areas where SQL Server is spending time waiting instead of getting those queries processed. Below are some resources that should help you with wait stats analysis:

http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/ – Wait statistics, or please tell me where it hurts (this is the classic Paul Randal article on the topic of wait stats)

Be sure to check out his other articles on this topic:

http://www.sqlskills.com/blogs/paul/category/wait-stats/

Paul Randal recently created an online waits library documenting the various wait types seen in SQL Server:

https://www.sqlskills.com/help/waits/

Brent Ozar also has a very good article & script on wait stats as part of the “First Responder Kit”:

https://www.brentozar.com/responder/triage-wait-stats-in-sql-server/

4. Query Store

If you’re fortunate enough to work with SQL Server 2016 or Azure SQL Database then you certainly need to become familiar with the Query Store:

https://msdn.microsoft.com/en-us/library/dn817826.aspx – Monitoring Performance By Using the Query Store

This is SQL Server’s new black box. It’s what will enable you to understand query performance problems in the recent past when you were not directly monitoring SQL Server. And certainly make sure Query Store is enabled and recording – otherwise it won’t do you much good 😉

Additional resources

https://www.youtube.com/watch?v=rf17jQRcfjI – The Server is Down What will you do

https://www.red-gate.com/library/troubleshooting-sql-server-a-guide-for-accidental-dbas – Troubleshooting SQL Server: A Guide for Accidental DBAs (free ebook)

https://www.red-gate.com/library/performance-tuning-with-sql-server-dynamic-management-views – Performance Tuning With SQL Server Dynamic Management Views (free ebook)

It’s time to seriously consider Identity as a Service (IDaaS) solutions (such as Azure AD B2C) for user authentication

October 10, 2016 by Cristian Satnic

If you watched the news in 2016 alone it would be pretty clear that many organizations are doing a very poor job in protecting the user credentials that are in their care. Even more, when organizations somehow lose credentials for hundreds of millions of users (looking at you Yahoo) and users are not fully up-in-arms about it we know we’re in a bad place – the place where users are so used to hearing bad news related to security or privacy that they almost don’t care anymore.

We can do better than this. The goal of this post is to present arguments on why authentication mechanisms in many organizations are failing and show why Identity as a Service (IDaaS) solutions such as Azure Active Directory B2C (business-to-consumer) are the future.

According to a Gartner report from June 2016 (see link below in the resources section) by 2020, 40% of identity and access management (IAM) purchases will use the identity and access management as a service (IDaaS) delivery model — up from less than 20% in 2016. If you’re planning to build any new consumer facing applications in the near future or if the security of your existing application credentials is keeping you up at night then you should seriously start to look at IDaaS solutions.

Challenges with traditional consumer identity and access management systems

Security & privacy risks

The username/password list is a target for attackers – attackers look for easy targets knowing consumers often reuse login credentials across accounts (the credentials list is often more important than the data it protects)
Developers often don’t really understand well how to properly secure passwords with custom solutions
Attacks are getting more sophisticated / threats are constantly evolving

High TCO (total cost of ownership)

Development time & costs – lots of code to write for identity management functions (sign-up/sign-in, email verification, password resets, MFA, user experience UI/UX)
Software licensing, maintenance and upgrade costs (when using off-the-shelf software for identity management features)
Identity management functionality is a moving target – for example in the Microsoft .NET world built-in identity management approaches change every 1-2 years (ASP.NET Membership, ASP.NET Identity) leaving behind fragmented applications that are hard to maintain

Scalability and availability challenges

Consumer traffic is highly seasonal
Organizations are forced to provision for peak capacity
With millions of users this can be very costly

What is Identity as a Service (IDaaS) and where does it fit in?

Cloud-based service that provides a set of identity and access management functions
An authentication infrastructure that is built, hosted and managed by a third-party service provider
This is in contrast to traditional identity and access management (IAM) solutions that are typically completely on-premises and delivered via bundled software and/or hardware means
According to Gartner, IDaaS functionality includes:

Identity governance and administration (“IGA”) — this includes the ability to provision identities held by the service to target applications
Access — this includes user authentication, single sign-on (SSO), and authorization enforcement
Intelligence — this includes logging events and providing reporting that can answer questions such as “who accessed what, and when?”

IDaaS sounds interesting … why consider Azure Active Directory B2C?

Yes – there are quite a few IDaaS solution providers out there – so why am I advocating for Azure AD B2C? To answer that I’ll just point to a recent June 2016 study (see link below) where Gartner analyzed the IDaaS space:

Gartner 2016 Magic Quadrant for Identity and Access Management as a Service

Microsoft with its IDaaS offerings is currently in the leader quadrant. The success of its IDaaS solution (Azure Active Directory) is very closely tied to the success and growth of Microsoft Azure – its cloud solution – and by all indications it has a strong future ahead of it.

What exactly is Azure Active Directory B2C?

Cloud identity service with support for social accounts and app-specific (local) accounts
For enterprises and ISVs building consumer facing web, mobile & native apps
Builds on Azure Active Directory – a global identity service serving hundreds of millions of users and billions of sign-ins per day (same directory system used by Microsoft online properties – Office 365, XBox Live and so on)
Worldwide, highly-available, geo-redundant service – globally distributed directory across all of Microsoft Azure’s datacenters

How is it better than a custom authentication solution?

Easy to integrate consumer self-service capabilities (sign-up, password resets)
Site owner controls user experience (custom html & css for sign-in/sign-up)
Enterprise-grade security with continuously evolving anomalous activity, anti-fraud and account compromise detection systems (offload security to the real domain experts)
Benefits of security-at-scale – uses machine learning to watch billions of authentications per day across the entire Azure AD ecosystem and detect unusual behavior
Superior economics compared to on-premises – pay-as-you-go pricing + free tier
Based on open protocols and open standards – OAuth 2.0, OpenID Connect
Uses open source libraries for .NET, Node.js, iOS, Android and others / REST-based Graph API for management
Better and faster development experience for authentication / easy to integrate with existing sites wherever they’re running from (not just those in Azure)
Ability to easily integrate social logins if needed (Facebook, Google and such)
Support for MFA (multi-factor authentication)
Authentication database is separate from the application data / easier to enable SSO (single sign on) later across other apps in the enterprise (unified view of the consumer across apps)

Sounds interesting – tell me more: who is using Azure AD B2C?

Azure AD B2C is a natural choice for consumer facing apps hosted on Azure (but certainly not only for those)
Some stats (from Microsoft presentations as recent as September 2016) on Microsoft Azure AD (the technology that B2C is built on)

90% of Fortune 500 companies use Microsoft Cloud
More than 10 million Azure AD directories
More than 750 million user accounts in Azure AD
More than 110K third-party applications that use Azure AD each month
More than 1.3 billion authentications every day with Azure AD

The state of Indiana used Azure AD B2C for user authentication in order to integrate various features into a single citizen portal
- CASE STUDY! Creating a Citizen Identity for Indiana State with Azure B2C – if B2C can pass state regulations and requirements for security, availability and such that says something
Real Madrid (one of the most popular soccer clubs on the planet) uses Azure AD B2C to offer authentication services for their mobile app used by more than 450 million fans (Microsoft case study)

What about security for all this authentication data in the cloud?

To answer the topic of security for authentication data when IDaaS solutions are used I’ll just include a quote from the Gartner report I mentioned above – I totally agree with their assessment:

No security is perfect. Ultimately, prospective customers must decide whether vendors’ stated control sets are sufficient for their needs. IDaaS vendors give significant attention to ensure the security of their platforms. Based on the number of enterprise security breaches that have been made public, and the lack of any such breaches for IDaaS providers, Gartner believes that IDaaS vendors are more likely to provide better security for IAM services than their customers could provide for themselves.

Additional Resources

Modernize your app’s consumer identity management with Azure AD B2C – Azure AD B2C presentation from Microsoft Ignite 2016 (presentation slides)
Gartner 2016 Magic Quadrant for Identity and Access Management as a Service
7 laws of identity whitepaper – a classic resource from Kim Cameron, Architect of Identity at Microsoft