How to convert Unicode text to plain ASCII

Let’s say you have a string that was input by somebody out of your control… Maybe it got to you via a call to your web service from a remote client, indirectly via somebody else’s software. And you need to ensure that the string only contains ASCII characters but still want it to be human-readable. How? Most examples you’ll find online when asking how to do this, do not do what I want…

And they’re right. Sure, accented characters and unaccented characters are not the same. But I know of one time when I don’t care:

  1. When sending mobile text (SMS) messages via a third party, and that third party fails to send the messages if Unicode characters outside the range 0 to 127 are included.

So while converting accented characters to unaccented characters might not be strictly correct, it is as correct as I need in my use case. The answer is the String.Normalize method, which I’d never heard of.

The sample code that follows returns the following output: (where the first string is the input, and the second is the normalized ASCII string)

image

And here’s the code. I’d give credit to the LatinToAscii method, which I found on StackOverflow, but I lost the link. (Although this works for me, it wasn’t even the accepted answer.)

using System;
using System.Linq;
using System.Text;

namespace AsciiTest
{
    class Program
    {
        static void Main(string[] args)
        {
            string inputString = "Räksmörgås";
            string asAscii = LatinToAscii(inputString);

            Console.WriteLine(inputString);
            Console.WriteLine(asAscii);

            Console.ReadKey();
        }

        private static string LatinToAscii(string inString)
        {
            var builder = new StringBuilder();
            builder.Append(inString.Normalize(NormalizationForm.FormKD)
                                            .Where(x => x < 128)
                                            .ToArray());
            return builder.ToString();
        }
    }
}

It should be obvious what the code does, so I’m not going to explain. See this explanation for the NormalizationForm value passed to the method, for a thorough understanding.

Advertisements

About Jerome

I am a senior C# developer in Johannesburg, South Africa. I am also a recovering addict, who spent nearly eight years using methamphetamine. I write on my recovery blog about my lessons learned and sometimes give advice to others who have made similar mistakes, often from my viewpoint as an atheist, and I also write some C# programming articles on my programming blog.
This entry was posted in Programming and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s