Overview

We recently worked with a client to help them with their cloud migration and with exploring options to migrate data from Microsoft SQL Server to PostgreSQL.

Data migration for existing applications is one of the key aspects to consider as part of cloud migrations. Since we want to minimize outages and the possibility of data loss/disruption, it’s best to minimize any manual activity and human error and also to automate the data migration process as much as possible. Once the process is automated it can be easily applied to different database instances across the organization.

Key Considerations

  • In this post we’ll explore the database migration steps using the open-source sqlserver2pgsql tool.  Alternatively, the open-source pgloader may be another option, too. Unfortunately, that tool seems to have an issue connecting to Azure SQL, where the client’s Microsoft SQL Server instances were running and we did not get a chance to evaluate it for this use case.

  • There are also other tools on the market that can migrate the data, but not the schema, and would still rely on a tool like sqlserver2pgsql for that part. Please note that sqlserver2pgsql does not migrate any stored procedures.

  • Also, sqlserver2pgsql may be great for one off database migrations, but it’s not suitable for use cases that require continuous data synchronization between the source and target databases for an extended period of time. For those scenarios consider using tools like Debezium, Kafka Connectors, etc.

  • When migrating a database, make sure to also consider the source code changes needed to support the migration. While ORM tools such as Hibernate for Java may make it a smooth transition, in some cases the database migration may require a substantial amount of code rewrite and testing effort.

With the goal of automating the process as much as possible, this post will describe the steps needed to create a Docker image to run the data migration using sqlserver2pgsql.

First, we’ll take a look at the different tools and services used in the database migration process in this post and then go into the weeds of the actual migration. You’ll also find a link to the source code at the end of this post.

Table of Contents

Overview
Key Considerations
Table of Contents
Prerequisites
What is Microsoft SQL Server / Azure SQL?
What is PostgreSQL?
What is sqlserver2pgsql?
What is Kettle?
What is Docker?
Overview of the Migration Process
Building the Docker image
The Dockerfile
Migration script
Create container image
Export schema using MS SQL Server Management Studio
Download certs for PostgreSQL
Running the database migration
Verifying the database migration
Conclusion

Prerequisites

Since the database migration process will be migrating live data between the source and target databases, we’ll need to make sure that our Docker container will have network connectivity to both databases.

What is Microsoft SQL Server / Azure SQL?

Microsoft SQL Server is a relational database management system from Microsoft and Azure SQL Database is a flavor of Microsoft SQL Server optimized for the cloud. In this post Microsoft SQL Server will serve as the source database where we’ll be migrating data from.

Figure 1: Logo for Microsoft SQL Server

Since our client was using Azure SQL Database, we’ll also include some details specific to that in this post.

What is PostgreSQL?

PostgreSQL is an open source object-relational database system and it’s also available as an option to run as a relational database on the top 3 cloud providers (AWS, Azure and Google Cloud). We’ll be using PostgreSQL as the target database where we’ll be migrating the data to further down in this post.

Figure 2: Logo for PostgreSQL

What is sqlserver2pgsql?

Sqlserver2pgsql is an open source (GPL v3) tool specifically designed to migrate data from Microsoft SQL Server to PostgreSQL. It’s written in Perl and it can convert a Microsoft SQL Server schema to a PostgreSQL schema and also migrate data between these two databases using Pentaho Data Integrator (Kettle).

According to the author of sqlserver2pgsql, the tool should support most versions of Microsoft SQL Server.

What is Kettle?

Pentaho Data Integrator (aka Kettle) is an open source (GPL v2) Extract-Transform-Load (ETL) tool made by Pentaho. It’s built on Java and uses JDBC to connect to the source and target databases.

Sqlserver2pgsql generates a Kettle job file and we’ll pass it to Kettle to run it to migrate the data from Microsoft SQL Server to PostgreSQL.

What is Docker?

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.Wikipedia

Using Docker containers and Docker’s build once run anywhere model we can easily run applications like sqlserver2pgsql both locally and in the cloud. This makes it an ideal tool for creating a cloud agnostic database migration process that can be used in most IT infrastructure.

Figure 3: Logo for Docker

Overview of the Migration Process

Figure 4: Flow chart of the Migration Process

The above flow chart highlights the main steps needed to migrate both schema and data from Microsoft SQL Server to PostgreSQL. Each step will be discussed in detail below when covering the steps to build and run our custom Docker container.

Building the Docker image

As discussed above, using a Docker container for our database migration process will make it portable and cloud agnostic.

In the next steps we’ll build a Docker image to have the necessary dependencies (sqlserver2pgsql, Kettle, db driver, etc.) installed and we’ll also include a script that will execute the different steps of the migration process.

Once we have the Docker image created, we’ll be able to deploy it as a running Docker container that can run the database migration process.

Now, let’s see the steps needed to create our Docker image.

The Dockerfile

In order to build a Docker image, we’ll need to create our custom Dockerfile that contains the instructions for building the image.

Let’s build our image on the OpenJDK 8 container as it will be needed to run Kettle to migrate the data. Note that Kettle currently doesn’t support JDK 9 or higher:

FROM adoptopenjdk/openjdk8

We’ll also need to define a few environment variables to configure the source and target databases:

ENV SRC_HOST=
ENV SRC_PORT=1433
ENV SRC_USER=
ENV SRC_PWD=
ENV SRC_DB=

ENV DST_HOST=
ENV DST_PORT=5432
ENV DST_USER=
ENV DST_PWD=
ENV DST_DB=

Now let’s configure the working directory we’ll be using:

ENV MIGRATIONDIR=/opt/data_migration

RUN mkdir -p $MIGRATIONDIR

Next, we’ll need to install a few cli tools: perl, wget, unzip, the PostgreSQL client, etc.:

RUN apt-get update; apt-get install perl netcat -y; \
    apt-get install wget unzip postgresql-client -y

After that we’ll need to install Pentaho Data Integrator (Kettle) to run the data migration, as it’s required by sqlserver2pgsql:

RUN wget --progress=dot:giga https://sourceforge.net/projects/pentaho/files/latest/download?source=files -O /tmp/kettle.zip; \
    unzip /tmp/kettle.zip -d /tmp/kettle; \
    mv /tmp/kettle/data-integration $MIGRATIONDIR; \
    chmod -R +x $MIGRATIONDIR/data-integration/*.sh

Then we’ll also need to install jTDS, an open source JDBC driver used by sqlserver2pgsql to connect to SQL Server in the Kettle job:

RUN wget https://sourceforge.net/projects/jtds/files/latest/download?source=files -O /tmp/jtds.zip; \
    unzip /tmp/jtds.zip -d /tmp/jtds; \
    cp /tmp/jtds/jtds-*.jar $MIGRATIONDIR/data-integration/lib/; \
    rm -Rf /tmp/jtds;rm -f /tmp/jtds.zip

The last thing to install is sqlserver2pgsql. Please, note that we may also need to use a few sed commands to customize the way we connect to the source and target databases:

RUN wget https://raw.githubusercontent.com/dalibo/sqlserver2pgsql/master/sqlserver2pgsql.pl -P $MIGRATIONDIR; \
    # need ssl=require attribute to connect to Azure SQL:
    sed -i 's#<attribute><code>EXTRA_OPTION_MSSQL.instance#<attribute><code>EXTRA_OPTION_MSSQL.ssl</code><attribute>require</attribute></attribute><attribute><code>EXTRA_OPTION_MSSQL.instance#g' $MIGRATIONDIR/sqlserver2pgsql.pl; \
    # using certs to connect to Cloud SQL for Postgres:
    sed -i "s#<attribute><code>EXTRA_OPTION_POSTGRESQL.reWriteBatchedInserts#<attribute><code>EXTRA_OPTION_POSTGRESQL.ssl</code><attribute>true</attribute></attribute>\n<attribute><code>EXTRA_OPTION_POSTGRESQL.sslmode</code><attribute>verify-ca</attribute></attribute>\n<attribute><code>EXTRA_OPTION_POSTGRESQL.sslcert</code><attribute>$MIGRATIONDIR/conf/client-cert.pem</attribute></attribute>\n<attribute><code>EXTRA_OPTION_POSTGRESQL.sslkey</code><attribute>$MIGRATIONDIR/conf/client-key.pk8</attribute></attribute>\n<attribute><code>EXTRA_OPTION_POSTGRESQL.sslrootcert</code><attribute>$MIGRATIONDIR/conf/server-ca.pem</attribute></attribute>\n<attribute><code>EXTRA_OPTION_POSTGRESQL.reWriteBatchedInserts#g" $MIGRATIONDIR/sqlserver2pgsql.pl; \
    chmod +x $MIGRATIONDIR/sqlserver2pgsql.pl

The first sed command will update the Kettle job file to include the ssl=require param in the SQL Server JDBC URL. Note that it’s only needed when connecting to Azure SQL, otherwise it can be excluded.

The second sed command is only needed if connecting to PostgreSQL via SSL certs, otherwise it can be excluded, too.

When SSL certs are used, we’ll also need to copy them to a folder where the Docker container will be able to access it during the migration process (see details later). To be able to use SSL certs with PostgreSQL, the second sed command will update the Kettle job file to include the

ssl=true,
sslmode=verify-ca,
sslcert=$MIGRATIONDIR/conf/client-cert.pem, 
sslkey=$MIGRATIONDIR/conf/client-key.pk8, and
sslrootcert=$MIGRATIONDIR/conf/server-ca.pem, params in the PostgreSQL JDBC URL.

Finally, we need to configure the migrate.sh script that we’ll use to run the Kettle job to migrate the data and to set the working directory:

COPY ./scripts /scripts
RUN chmod +x /scripts/*.sh

WORKDIR $MIGRATIONDIR

In the above code snippet we’re mounting the local ./scripts folder that contains our script file.

Migration script

The migrate.sh script is responsible for creating the database schema in PostgreSQL and migrating the data from Microsoft SQL Server to PostgreSQL using the Kettle job generated by sqlserver2pgsql.

Now, let’s look into the content of the script.

It will need bash to run:

#!/bin/bash

Let’s also configure the script to fail if any of its commands fail:

set -e

Then we’ll need to run the sqlserver2pgsql.pl script to convert the SQL Server schema file to PostgreSQL scripts and to generate the Kettle job files that will process the data migration between the source and target databases:

echo !!! Creating Kettle job && \
./sqlserver2pgsql.pl -b before.sql -a after.sql -u unsure.sql -k kettlejobs -stringtype_unspecified -f conf/$SCHEMA_FILE \
  -sh $SRC_HOST -sp $SRC_PORT -su $SRC_USER -sw $SRC_PWD -sd $SRC_DB \
  -ph $DST_HOST -pp $DST_PORT -pu $DST_USER -pw $DST_PWD -pd $DST_DB

Once this command is executed, sqlserver2pgsql.pl will have the Kettle job generated at $MIGRATIONDIR/kettlejobs/migration.kjb, which we’ll come back to shortly.

The before.sql script will contain the Postgres SQL statements to create the table structure, and the after.sql script will contain the SQL statements to add primary keys, foreign keys, constraints, etc.

Next we’ll run the before.sql script.

Use below command, if using SSL certs to authenticate with PostgreSQL (note the sslmode, sslrootcert, sslcert and sslkey params needed for that):

echo !!! Executing before.sql && \
# restricting access to key file as per psql requirements:
chmod 0600 conf/client-key.pem && \
PGPASSWORD=$DST_PWD psql -h $DST_HOST -p $DST_PORT -U $DST_USER -d $DST_DB -v sslmode=verify-ca -v sslrootcert=conf/server-ca.pem -v sslcert=conf/client-cert.pem -v sslkey=conf/client-key.pem -f before.sql

Use below command, if only using username / password to authenticate with PostgreSQL:

echo !!! Executing before.sql && \
# restricting access to key file as per psql requirements:
chmod 0600 conf/client-key.pem && \
PGPASSWORD=$DST_PWD psql -h $DST_HOST -p $DST_PORT -U $DST_USER -d $DST_DB -f before.sql

The following step is optional, in case the superuser permissions are not granted to our postgres user. For example, this is the case when using Google Cloud SQL for PostgreSQL:

# removing SQL code that was causing issue in GCP due to lack of superuser permissions (https://github.com/dalibo/sqlserver2pgsql/issues/124):
sed -i 's/DROP CAST IF EXISTS (varchar as date)//g' kettlejobs/migration.kjb
sed -i 's/CREATE CAST (varchar as date) with inout as implicit;//g' kettlejobs/migration.kjb
sed -i 's/DROP CAST (varchar as date)//g' kettlejobs/migration.kjb

Now it’s time to run the Kettle job:

echo !!! Running Kettle job && \
data-integration/kitchen.sh -file=kettlejobs/migration.kjb -level=rowlevel

Lastly, let’s run the after.sql script. Similar to the before.sql script we need to use different commands depending on whether we need SSL cert based authentication with PostgreSQL or not:

Use below command, if using SSL certs to authenticate with PostgreSQL (note the sslmode, sslrootcert, sslcert and sslkey params needed for that):

echo !!! Executing after.sql && \
PGPASSWORD=$DST_PWD psql -h $DST_HOST -p $DST_PORT -U $DST_USER -d $DST_DB -v sslmode=verify-ca -v sslrootcert=conf/server-ca.pem -v sslcert=conf/client-cert.pem -v sslkey=conf/client-key.pem -f after.sql

Use below command, if not using SSL certs to authenticate with PostgreSQL:

echo !!! Executing after.sql && \
PGPASSWORD=$DST_PWD psql -h $DST_HOST -p $DST_PORT -U $DST_USER -d $DST_DB -f after.sql

Create container image

In order to build our docker container, run the following command:

docker build -t sqlserver2psql .

Export schema using MS SQL Server Management Studio

Before starting the data migration we’ll need to export the schema file from SQL Server. See below steps to do that using SQL Server Management Studio.

Note: these steps may slightly vary for different SQL Server Management Studio versions.

Step #1

In SQL Server Management Studio, right click on the database you want to export and select Tasks / Generate Scripts…

Figure 5: visual instructions for Step #1

Step #2

Click “Next” on the welcome screen (if it hasn’t already been deactivated)

Figure 6: visual instructions for Step #2

Step #3

Select the tables you want to export (or select all), then click “Next”

Figure 7: visual instructions for Step #3

Step #4

Select ‘Save as a script file’, select the ‘Single script file’ option and save the script file as schema.sql under folder <project root dir>/conf (create the folder, if necessary).

Also, select unicode encoding to ensure that all possible naming conventions are permitted and does not cause any unnecessary issues. Click “Next”

Figure 8: visual instructions for Step #4

Step #5

Click “Next” on the ‘Review your selections’ screen

Figure 9: visual instructions for Step #5

Step #6

Click “Finish” when the process completed

Figure 10: visual instructions for Step #6

Download certs for PostgreSQL

Note: This step is optional and only needed if SSL certs are used for PostgreSQL authentication.

For example, in the case of Google Cloud follow steps at https://cloud.google.com/sql/docs/postgres/configure-ssl-instance to download certs. You may need to download the client key cert from the GCP Secret Manager.

Once the certs are downloaded, copy them to folder <project root dir>/conf.

Make sure the cert files are saved with these names: server-ca.pem, client-cert.pem and client-key.pem.

Running the database migration

Now that we have completed all the prerequisites, we’re ready to start the data migration process by running the /scripts/migrate.sh  script in the Docker container that we created. As discussed earlier, the script will execute these steps:

  1. Run sqlserver2pgsql to generate the PostgreSQL DDL scripts and the Kettle job for the data migration

  2. Run the generated before.sql PostgreSQL DDL script to create the db tables

  3. Run the generated Kettle job migrate the data between the databases

  4. Run the after.sql PostgreSQL DDL script to complete the db schema migration (eg: adding missing constraints, etc.)

See below command to run the Docker container to migrate the data. Note that here we’re mounting the conf dir which includes both the schema.sql file and the certs, if using SSL cert based authentication with PostgreSQL. The SSL certs will be needed in the migrate.sh script (see above):

docker run --name sqlserver2psql --rm -e SRC_HOST=<SQL Server host> \
  -e SRC_USER=<SQL Server username> -e SRC_PWD="<SQL Server password>" \
  -e SRC_DB=<SQL Server db name> -e DST_HOST=<PostgreSQL host> \
  -e DST_PORT=5432 -e DST_USER=<PostgreSQL username> \
  -e DST_PWD=<PostgreSQL password> -e DST_DB="<PostgreSQL db name>"
  --mount type=bind,source="$(pwd)"/conf,target=/opt/data_migration/conf \
  sqlserver2psql /scripts/migrate.sh

Notice the double quotes around the passwords, as it will be needed in case a password includes any special character.

Verifying the database migration

Once the data is migrated to the target database it is critical to verify that the migration process completed successfully and we still maintain data integrity in the target database. As a result, we need to make sure that no data got lost and there was no data mismatch.

Also, we’ll need to verify that the application code can connect to the target database and access the migrated data.

There are different levels to confirm the success of the data migration process and ideally it’s best to strive to automate as many steps as possible.

See below a list of our recommended verification steps:

  1. Verify the target schema matches the source schema: ensure that the original primary keys, foreign keys, matching column types, non-null columns, default column values, indices, constraints, etc. are still in place

  2. Verify table count

  3. Verify table names

  4. Verify record counts in each table

  5. Verify data within each table

    1. Either by comparing all records or only a subset of all records within each table

    2. If this cannot be fully automated, a semi-automated approach could be to export data from both source and target databases into CSV files and do a diff merge on those CSV files for each table

  6. Verify application code doesn’t break with the target database: run integration, functional and regression tests while the application code is connected to the target database.

  7. Verify application code performance: running load tests can confirm no performance degradation was introduced when executing SQL statements in the target database

Both the data migration and the verification processes should be run in lower environments first to minimize risks of things going wrong in the production environment.

Testing these processes in staging / pre-prod environments may also help forecasting the expected downtime in the production environment that could be communicated with the project stakeholders and the customers.

Conclusion

In this post we discussed the steps needed to migrate data from SQL Server to PostgreSQL using sqlserver2pgsql running in a Docker container.

Here’s the link to the git repository with the source code for building the Docker image: https://github.com/Nuvalence/sqlserver2pgsql-docker.