How to parse HL7 2.x messages stored in SQL Server using Python

jupyter

I’ve recently had a need to parse HL7 2.x messages stored in SQL Server. If you don’t know what HL7 2.x healthcare messages look like then here’s a quick sample:

Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.

For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.

I know I must get help from somewhere else. I’ve recently been experimenting with Python and with some quick Google searches I was convinced that Python had some interesting packages that could help with HL7 2.x parsing. Now – keep in mind that this is SQL Server 2016 so SQL Server Machine Learning Services is not available for this version … only for SQL Server 2017 and higher (this is the capability that would enable native execution of R and Python scripts directly in SQL Server T-SQL code). If I can’t run Python code natively in this version of SQL Server then I must execute this Python code from the outside.

I’ve also been looking at Jupyter Notebook recently – an awesome environment for interactive exploration of data sets using languages such as Python – so this was a good excuse to try to bring together all these technologies to enable me to look at HL7 data.

This is not a blog post about Jupyter but if you’d like to take a look at it the easiest way to do it is by installing the Anaconda Distribution which will bring together Jupyter, Python and a ton of other packages and frameworks that are super-useful for data analysis, data science and machine learning.

Once you have Jupyter installed this is the process we’ll follow to deal with HL7 data in SQL Server:

  • Connect in Python to SQL Server

 

 

  • Bring data from the SQL Server table into a pandas data frame (pandas is a well known package in Python that makes data exploration really easy)

  •  Combine the pandas data frame functionality with HL7 parsing in Python in order to filter HL7 messages as needed

… which produces the following output using the sample data I was looking at:

The HL7 parsing functionality in Python is provided by this package:

https://python-hl7.readthedocs.io/en/latest/

If you’d like to follow along you can look at my GitHub repo below where I’ve provided some sample HL7 2.x messages, a SQL script to create a table with that data and Jupyter notebook files to directly execute the relevant Python code:

https://github.com/csatnic/PythonHL7Parsing

 

 

 

Strategy and tips for performance troubleshooting in SQL Server

There was probably a time in its early days when SQL Server could be considered a simple database engine – we’d run queries with SQL and we were probably pleasantly surprised when stuff just worked. Much has changed since then. Over time SQL Server has evolved into a complex relational database management system (RDBMS). If there is some task that involves data manipulation then SQL Server most likely has a component to handle that – data queries with SQL, multi-dimensional analysis, statistical analysis with R, reporting, integration with other data sources, ETL and so on.

On one hand it is amazing that given all this complexity there are plenty of workloads that work just well enough on their own (no full-time DBA needed) if enough hardware resources are made available.

On the other hand, there are certainly times when performance troubleshooting is required – either because SQL Server’s response got worse over time or because it’s just not at the level where it needs to be to keep users happy.

With all these system components fighting for limited hardware and infrastructure resources how is a DBA supposed to troubleshoot performance problems? When angry users are on the phone or when your boss is looking over your shoulder demanding quick results – that is hardly the time to come up to speed on the finer points of waits statistics analysis.

Below is a general strategy with some specific tips and tools to guide you on the road to a better SQL Server performance.

1. Understand the environment and see how SQL Server is configured (the sanity check)

This first step is very important especially when you have to troubleshoot an environment you’re not familiar with (for example when you’re a performance consultant) or when you work at a company where many people can make changes to the SQL Server environment and you’re not really sure if what was true last month about the configuration is still true today. It doesn’t make any sense to spend time looking for bad queries only to discover eventually that somebody made changes to the SQL Server memory configuration, confused megabytes with kilobytes and now SQL Server is running with much less memory available for its own use.

Here are some scripts to help in this discovery phase:

Brent Ozar and company have an awesome set of scripts in their “First Responder Kit” that’s now available on GitHub:

https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit

The main one that you want to get familiar with is sp_Blitz but certainly look over the other ones as well. sp_Blitz performs various checks in configuration settings and it will alert you when it discovers anything that’s not considered a best practice.

The other set of very useful diagnostic scripts are those provided by Glenn Berry:

http://www.sqlskills.com/blogs/glenn/category/dmv-queries/

He usually updates them every month and targets specific features in the various SQL Server versions available.

If all looks good with the environment and configuration then it’s time to move to the next step.

2. What is SQL Server doing right now?

There are probably many commercial tools that offer all sorts of monitoring capabilities but the one free tool that every SQL Server DBA should be familiar with is sp_WhoIsActive (this is my ‘don’t leave SQL Server without it’ tool):

http://whoisactive.com/

It is an awesome stored procedure by Adam Machanic that shows all sorts of relevant data about what exactly is going on at the transaction level when the stored procedure is executed. It has tons of flags and options – all meant to customize the output in order to facilitate a performance troubleshooting session.

Adam wrote a series of blog posts explaining all its various features – it’s well worth reading these in advance so you’re familiar with them before you need to use them under pressure:

http://sqlblog.com/blogs/adam_machanic/archive/tags/month+of+monitoring/default.aspx

It’s possible that when you’re looking strictly at what’s going on in the present you may not be able to determine why SQL Server is hurting. In that case you’ll probably need to go to the next step:

3. What is SQL Server waiting on during a certain time interval? (wait statistics analysis)

SQL Server is a complex system but it’s also a pretty good patient – it keeps detailed track of many vital statistics but you need to know where to look for that data. SQL queries, until they finish their execution, go through many different kinds of waits: waiting for CPU resources, waiting for data to be moved to memory from disk, waiting for data that other queries have locks on and so on. An analysis of all these waits over time can offer critical insight into areas where SQL Server is spending time waiting instead of getting those queries processed. Below are some resources that should help you with wait stats analysis:

http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/ – Wait statistics, or please tell me where it hurts (this is the classic Paul Randal article on the topic of wait stats)

Be sure to check out his other articles on this topic:

http://www.sqlskills.com/blogs/paul/category/wait-stats/

Paul Randal recently created an online waits library documenting the various wait types seen in SQL Server:

https://www.sqlskills.com/help/waits/

Brent Ozar also has a very good article & script on wait stats as part of the “First Responder Kit”:

https://www.brentozar.com/responder/triage-wait-stats-in-sql-server/

4. Query Store

If you’re fortunate enough to work with SQL Server 2016 or Azure SQL Database then you certainly need to become familiar with the Query Store:

https://msdn.microsoft.com/en-us/library/dn817826.aspx – Monitoring Performance By Using the Query Store

This is SQL Server’s new black box. It’s what will enable you to understand query performance problems in the recent past when you were not directly monitoring SQL Server. And certainly make sure Query Store is enabled and recording – otherwise it won’t do you much good ;)

Additional resources

https://www.youtube.com/watch?v=rf17jQRcfjI – The Server is Down What will you do

https://www.red-gate.com/library/troubleshooting-sql-server-a-guide-for-accidental-dbas – Troubleshooting SQL Server: A Guide for Accidental DBAs (free ebook)

https://www.red-gate.com/library/performance-tuning-with-sql-server-dynamic-management-views – Performance Tuning With SQL Server Dynamic Management Views (free ebook)

It’s time to seriously consider Identity as a Service (IDaaS) solutions (such as Azure AD B2C) for user authentication

If you watched the news in 2016 alone it would be pretty clear that many organizations are doing a very poor job in protecting the user credentials that are in their care. Even more, when organizations somehow lose credentials for hundreds of millions of users (looking at you Yahoo) and users are not fully up-in-arms about it we know we’re in a bad place – the place where users are so used to hearing bad news related to security or privacy that they almost don’t care anymore.

We can do better than this. The goal of this post is to present arguments on why authentication mechanisms in many organizations are failing and show why Identity as a Service (IDaaS) solutions such as Azure Active Directory B2C (business-to-consumer) are the future.

b2c-auth-options

According to a Gartner report from June 2016 (see link below in the resources section) by 2020, 40% of identity and access management (IAM) purchases will use the identity and access management as a service (IDaaS) delivery model — up from less than 20% in 2016. If you’re planning to build any new consumer facing applications in the near future or if the security of your existing application credentials is keeping you up at night then you should seriously start to look at IDaaS solutions.

Challenges with traditional consumer identity and access management systems

Security & privacy risks

  • The username/password list is a target for attackers – attackers look for easy targets knowing consumers often reuse login credentials across accounts (the credentials list is often more important than the data it protects)
  • Developers often don’t really understand well how to properly secure passwords with custom solutions
  • Attacks are getting more sophisticated / threats are constantly evolving

High TCO (total cost of ownership)

  • Development time & costs – lots of code to write for identity management functions (sign-up/sign-in, email verification, password resets, MFA, user experience UI/UX)
  • Software licensing, maintenance and upgrade costs (when using off-the-shelf software for identity management features)
  • Identity management functionality is a moving target – for example in the Microsoft .NET world built-in identity management approaches change every 1-2 years (ASP.NET Membership, ASP.NET Identity) leaving behind fragmented applications that are hard to maintain

Scalability and availability challenges

  • Consumer traffic is highly seasonal
  • Organizations are forced to provision for peak capacity
  • With millions of users this can be very costly

What is Identity as a Service (IDaaS) and where does it fit in?

  • Cloud-based service that provides a set of identity and access management functions
  • An authentication infrastructure that is built, hosted and managed by a third-party service provider
  • This is in contrast to traditional identity and access management (IAM) solutions that are typically completely on-premises and delivered via bundled software and/or hardware means
  • According to Gartner, IDaaS functionality includes:
    • Identity governance and administration (“IGA”) — this includes the ability to provision identities held by the service to target applications
    • Access — this includes user authentication, single sign-on (SSO), and authorization enforcement
    • Intelligence — this includes logging events and providing reporting that can answer questions such as “who accessed what, and when?”

IDaaS sounds interesting … why consider Azure Active Directory B2C?

Yes – there are quite a few IDaaS solution providers out there – so why am I advocating for Azure AD B2C? To answer that I’ll just point to a recent June 2016 study (see link below) where Gartner analyzed the IDaaS space:

Gartner 2016 Magic Quadrant for Identity and Access Management as a Service

Microsoft with its IDaaS offerings is currently in the leader quadrant. The success of its IDaaS solution (Azure Active Directory) is very closely tied to the success and growth of Microsoft Azure – its cloud solution – and by all indications it has a strong future ahead of it.

What exactly is Azure Active Directory B2C?

  • Cloud identity service with support for social accounts and app-specific (local) accounts
  • For enterprises and ISVs building consumer facing web, mobile & native apps
  • Builds on Azure Active Directory – a global identity service serving hundreds of millions of users and billions of sign-ins per day (same directory system used by Microsoft online properties – Office 365, XBox Live and so on)
  • Worldwide, highly-available, geo-redundant service – globally distributed directory across all of Microsoft Azure’s datacenters

How is it better than a custom authentication solution?

  • Easy to integrate consumer self-service capabilities (sign-up, password resets)
  • Site owner controls user experience (custom html & css for sign-in/sign-up)
  • Enterprise-grade security with continuously evolving anomalous activity, anti-fraud and account compromise detection systems (offload security to the real domain experts)
  • Benefits of security-at-scale – uses machine learning to watch billions of authentications per day across the entire Azure AD ecosystem and detect unusual behavior
  • Superior economics compared to on-premises – pay-as-you-go pricing + free tier
  • Based on open protocols and open standards – OAuth 2.0, OpenID Connect
  • Uses open source libraries for .NET, Node.js, iOS, Android and others / REST-based Graph API for management
  • Better and faster development experience for authentication / easy to integrate with existing sites wherever they’re running from (not just those in Azure)
  • Ability to easily integrate social logins if needed (Facebook, Google and such)
  • Support for MFA (multi-factor authentication)
  • Authentication database is separate from the application data / easier to enable SSO (single sign on) later across other apps in the enterprise (unified view of the consumer across apps)

Sounds interesting – tell me more: who is using Azure AD B2C?

  • Azure AD B2C is a natural choice for consumer facing apps hosted on Azure (but certainly not only for those)
  • Some stats (from Microsoft presentations as recent as September 2016) on Microsoft Azure AD (the technology that B2C is built on)
    • 90% of Fortune 500 companies use Microsoft Cloud
    • More than 10 million Azure AD directories
    • More than 750 million user accounts in Azure AD
    • More than 110K third-party applications that use Azure AD each month
    • More than 1.3 billion authentications every day with Azure AD
  • The state of Indiana used Azure AD B2C for user authentication in order to integrate various features into a single citizen portal
  • Real Madrid (one of the most popular soccer clubs on the planet) uses Azure AD B2C to offer authentication services for their mobile app used by more than 450 million fans (Microsoft case study)

What about security for all this authentication data in the cloud?

To answer the topic of security for authentication data when IDaaS solutions are used I’ll just include a quote from the Gartner report I mentioned above – I totally agree with their assessment:

No security is perfect. Ultimately, prospective customers must decide whether vendors’ stated control sets are sufficient for their needs. IDaaS vendors give significant attention to ensure the security of their platforms. Based on the number of enterprise security breaches that have been made public, and the lack of any such breaches for IDaaS providers, Gartner believes that IDaaS vendors are more likely to provide better security for IAM services than their customers could provide for themselves.

Additional Resources

Amazon, microservices and the birth of AWS cloud computing

I started doing some research on microservices and came across this really interesting video from about 5 years ago where Werner Vogels, Amazon’s CTO, talks about how (and why) Amazon switched to a microservices architecture. It’s a really interesting presentation that explains the challenges that amazon.com was facing in its early years and how internal solutions to those early problems were the basis for AWS cloud computing later.

Werner Vogels – Amazon and the Lean Cloud

It’s a relatively short presentation – about 30 minutes – but it’s full of interesting details about those ‘early days’ of cloud computing. Here are some highlights:

  • In the early 2000’s Amazon’s main e-commerce site – amazon.com – was facing some technical challenges. Its architecture at that time was typical of the web applications we still build today – a single monolith application code base, a common technology stack in all web areas, with massive relational databases on the backend. What were some of the problems they were having in those early days? Code compiles and deployments were taking too long. The backend databases were massive and hard to manage. Bottlenecks existed everywhere – it was getting harder and harder to make progress, release new features and keep up with growth.
  • Amazon’s technical architects analyzed the problem and realized that the path they were on would not take them far in the future. The decision was made to move towards a microservices architecture (they didn’t call it that back then but that’s what they were basically building). The idea with microservices was that every little feature and capability for the retail site would be provided by a mini-service that would interact with other services through well-defined interfaces. This is the path that amazon.com went on for the next few years. According to Werner the current homepage for amazon.com is put together by a few hundred such microservices.
  • It’s hard to believe that such an architecture could actually work at the scale that amazon.com needed – it sounds like the perfect recipe for chaos. Specific changes were needed to how Amazon’s internal teams worked in order to make it work. The idea of “two-pizza teams” was at the core – a team supporting a particular microservice should not be bigger than the number of developers who could eat two pizzas. This usually meant no more than 10 technical folks to such a team – a perfect number for a team that could do work without needing complex meetings to bring everybody up-to-date on progress. Teams chose the technology stack they would use for a particular microservice. Another critical concept was the idea of “you build it, you run it”. These small teams were in charge with development and operations for their service (they were doing devops before it was actually cool). Amazon now had hundreds of such teams working on the amazon.com site.
  • Things were going well initially but they realized after a while that the rate of progress and productivity was slowing down. A more careful analysis of the situation showed that these teams were now spending close to 70% of their time doing operations work – making sure that their services would be operational according to the standards for high availability required for amazon.com. Engineers were solving the same problems over and over on their own because they had no common internal infrastructure resources they could use.
  • This is when the idea of infrastructure on demand started to come up – the beginning of the AWS cloud operations. First, object storage (S3) … then compute (EC2) and on they went from there. Somehow along the way these internal elastic ‘cloud’ capabilities were exposed to external customers and the rest is history.

It’s indeed a fascinating inside look at how the AWS cloud was born. If you’ve wondered how come Amazon, an online book retailer, ended up being a cloud computing powerhouse then this video will give you some of the answers.