Python

I’ve recently had a need to parse HL7 2.x messages stored in SQL Server. If you don’t know what HL7 2.x healthcare messages look like then here’s a quick sample:

MSH|^~\&|HIS|MedCenter|LIS|MedCenter|20060307110114||ORM^O01|MSGID20060307110114|P|2.3
PID|||12001||Jones^John^^^Mr.||19670824|M|||123 West St.^^Denver^CO^80020^USA|||||||
PV1||O|OP^PAREG^||||2342^Jones^Bob|||OP|||||||||2|||||||||||||||||||||||||20060307110111|
ORC|NW|20060307110114
OBR|1|20060307110114||003038^Urinalysis^L|||20060307110114

Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.

For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.

I know I must get help from somewhere else. I’ve recently been experimenting with Python and with some quick Google searches I was convinced that Python had some interesting packages that could help with HL7 2.x parsing. Now – keep in mind that this is SQL Server 2016 so SQL Server Machine Learning Services is not available for this version … only for SQL Server 2017 and higher (this is the capability that would enable native execution of R and Python scripts directly in SQL Server T-SQL code). If I can’t run Python code natively in this version of SQL Server then I must execute this Python code from the outside.

I’ve also been looking at Jupyter Notebook recently – an awesome environment for interactive exploration of data sets using languages such as Python – so this was a good excuse to try to bring together all these technologies to enable me to look at HL7 data.

This is not a blog post about Jupyter but if you’d like to take a look at it the easiest way to do it is by installing the Anaconda Distribution which will bring together Jupyter, Python and a ton of other packages and frameworks that are super-useful for data analysis, data science and machine learning.

Once you have Jupyter installed this is the process we’ll follow to deal with HL7 data in SQL Server:

Connect in Python to SQL Server

import pandas as pd
import pyodbc

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=localhost;DATABASE=Training;Trusted_Connection=yes')

query = "SELECT * FROM HL7Messages"

Bring data from the SQL Server table into a pandas data frame (pandas is a well known package in Python that makes data exploration really easy)

df = pd.read_sql(query, sql_conn)

Combine the pandas data frame functionality with HL7 parsing in Python in order to filter HL7 messages as needed

import hl7

for index, row in df.iterrows():
    h = hl7.parse(row['HL7Message'])
    print("Message type: {}, Patient ID: {}".format(h.segment('MSH')[9], h.segment('PID')[3]))

… which produces the following output using the sample data I was looking at:

Message type: ORM^O01, Patient ID: 12001
Message type: ORU^R01, Patient ID: 999999999

The HL7 parsing functionality in Python is provided by this package:

https://python-hl7.readthedocs.io/en/latest/

If you’d like to follow along you can look at my GitHub repo below where I’ve provided some sample HL7 2.x messages, a SQL script to create a table with that data and Jupyter notebook files to directly execute the relevant Python code:

https://github.com/csatnic/PythonHL7Parsing

Over the last few days I was working through some AWS security best practices as outlined in a video presentation from AWS re:Invent 2013 called “Intrusion Detection in the Cloud (SEC402)” (video link – PDF presentation link).

One of the interesting ideas that the authors were promoting was this particular Python script (script link – JSON template for IAM policy) which would use the AWS Python SDK to obtain textual descriptions of various critical security settings that would be part of an AWS account (such as IAM users, groups and policies, S3 bucket policies, EC2 security groups and so on). These settings could all be exported and saved in a text file, the script could be scheduled to run at various intervals and alerts could be raised each time differences in the file would be detected.

I’m currently in the middle of setting up the AWS environment for a client project so I liked the idea of having a script such as the one mentioned above which I could use to script various critical settings and monitor how they change over time. Besides … Python is not really a language that I use often (though I do plan to become more familiar with it) so I welcomed the chance to try something new – how hard could it really be? 😉

I initially proceeded to download and install the latest version of Python – 3.4.2. It didn’t take too long to realize that the script authors actually wrote it for the Python 2.x branch. I was getting all sorts of errors trying to run it under 3.4.2 – as soon as I managed to get one of them fixed, another one came up. I didn’t really care much which version of Python I’d use so since 3.4.2 was giving me enough trouble I decided to switch to the 2.x branch and installed 2.7.9.

Success … or so it seemed. The script finally started running until half-way through its execution it died with this strange looking error while trying to connect to some AWS API endpoint:

ERROR: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

I was pretty sure that Python was not happy about some discrepancy it found in a HTTPS certificate that it received while trying to make a HTTPS call to AWS. The problem was that the HTTPS site it was reaching out to was controlled by AWS and used deep in the AWS Python SDK code – not a resource for which I could do anything about the configuration of its HTTPS certificate.

So what exactly was Python complaining about? After some research I came across this link:

https://www.python.org/dev/peps/pep-0476/

It appears that a fairly recent change was made in the 2.7.x branch in regards to the default behavior for certificate verification. I ended up using the following monkey patch to globally disable verification for all HTTPS calls made by the script:

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

The script finally ran successfully to the end and I had my scripted security settings. All in all it was an interesting first adventure in the land of Python.

How to parse HL7 2.x messages stored in SQL Server using Python

HTTPS certificate verification problem in Python 2.7.9