How to Scrub Characters from Text in SQL Server

Written by Ryan Geide 03/13/20


One particularly tedious task that folks developing ETL operations in T-SQL face at least once in their careers is the task of scrubbing textual data of unwanted – and sometimes non-displayable – characters. Somehow these special characters that cannot be displayed end up in the data; few know how or why and obviously they cannot be seen so we find out about them when some process chokes in the dead of night on a weekend because the antiquated “BEL” character somehow showed up in the data.

Perhaps your marketing team copy/pastes product descriptions from a manufacturer’s website (special characters and all) into a product’s description field.  The bullet character is preserved and other applications are unsure how to display this alien character. We don’t want to remove the bullet character, rather we’d like to replace it with an asterisk; maybe we are storing text data in a data-type that cannot handle characters with diacritical marks and we want to make ñ characters just be “n”.

The task of scrubbing the data has fallen to us and we are not quite sure how best to approach the problem. We are sort of interested in a performant solution, but we want accuracy and versatility. Below we’ll walk through a simple solution to remove or replace characters in text data.

Let’s start by outlining the major components:

1.     A scalar function that text data runs through and comes out clean

2.     A table that stores our character substitution mappings. Characters to be removed map to an empty value.

The below solution uses PATINDEX to identify characters and STUFF to replace them:

 

DECLARE @strInput                   NVARCHAR(4000);

DECLARE @intPosition  SMALLINT; –Position of bad character

DECLARE @strPattern               NVARCHAR(4000);  –Bad characters to look for

 

SET @strInput = ‘José Calderón’;

 

DECLARE @tblCharacterMappings TABLE

(

            tKey     NVARCHAR(2) PRIMARY KEY,

            tValue  NVARCHAR(2)

);

 

INSERT INTO @tblCharacterMappings (tKey, tValue) VALUES

(N’é’, N’e’),

(N’ó’, N’o’);

 

SELECT @strPattern = SUBSTRING

(

            (

                        SELECT 

                                    ” + tKey

                        FROM 

                                    @tblCharacterMappings

                        FOR XML PATH(”)

            ), 1, 10000

);

 

SET @strPattern = ‘%[‘+ @strPattern + ‘]%’;

SET @intPosition = PATINDEX(@strPattern, @strInput);

 

WHILE @intPosition > 0

BEGIN

            SELECT 

                        @strInput = STUFF(@strInput, @intPosition, 1, tValue),

                        @intPosition = PATINDEX(@strPattern, @strInput) 

            FROM

                        @tblCharacterMappings

            WHERE

                        tKey = SUBSTRING(@strInput, @intPosition, 1);

END

                                    

SELECT @strInput;

 

 

This simple solution could easily be made into a scalar function, if you want; however, be cautious of how you use scalar functions. Try to keep them out of the WHERE clause unless you know what you are doing and certainly keep them out of your JOIN logic!.

Leave a Reply

Your email address will not be published. Required fields are marked *