Sunday, February 13, 2011

How To Create XML in C# with UTF-8 Encoding

I can’t believe that I spent over an hour on this problem.  It’s one of those things that you can’t understand how it’s possible that it doesn’t just work. But it doesn’t, so here’s the solution I found for anyone who might have the same issue.

The Problem: My XML refuses to use UTF-8 encoding

I have some code where I’m doing a query and creating an XML doc from the data that’s returned.  I used LINQ to XML to create my XML and package it up in an XDocument.  The XDocument allows me to set the XML version and the encoding. Here’s the code I used to create and return my XDocument.

            var declaration = new XDeclaration("1.0", "utf-8", "yes");
            return new XDocument(declaration, jobs);

So far so good.  Now the last (and what should be trivial) step is to write that XML to a string and return it.  This is where things go wrong.  To get an XDocument object to write out a full XML document you call it’s Save method.  No problem, the Save method has an overload that takes a StringWriter parameter so I’ll just create a StringWriter and a StringBuilder and I’ll be in business.  Here’s my initial code.

        // GetFeedXML
        public virtual string GetFeedXML()
        {
            List<JobFeedItem> list = GetFeedData();
            XDocument xdoc = BuildXmlFor(list);
            var sb = new StringBuilder();
            var sw = new StringWriter(sb);
            xdoc.Save(sw);
            return sb.ToString();
        }

That all looks good. So I run it and here’s the result.

    <?xml version="1.0" encoding="utf-16" standalone="yes" ?>
    <jobs>
      <job>
        <title>Audit Engagement Manager</title>

What the heck is that???  UTF-16??  It’s like the UTF-8 encoding that I set in my XDocument was completely ignored and .Net decided to use UTF-16 encoding.  How can this not work?  Where else do I even have an option to set the encoding? The StringWriter won’t let me set the encoding, the StringBuilder certainly won’t let me set the encoding.

So I started Googling and found that other people were having the same problem and I found several solutions that sounded plausible.  Most of them centered around creating an XMLWriter with a settings object that would allow me to set the encoding.  All of these required a fair bit of extra code and I never got one of them to actually work, but they did get me looking at the StringWriter as the place where I needed to set my encoding.

The Solution: Subclass StringWriter

At this point I’m looking into progressively more and more complex solutions when I stumble across a post by Ian Dykes called Writing XML with UTF-8 Encoding using XmlTextWriter and StringWriter.  Ian basically says to create a new StringWriterWithEncoding class that inherits from StringWriter but allows you to set the encoding in the constructor.  I used the same idea and created the StringWriterUtf8 class below.  Instead of taking the encoding in a constructor, I opted to make the Encoding property always return UTF8.

    public class StringWriterUtf8 : StringWriter
    {
        public StringWriterUtf8(StringBuilder sb) : base(sb)
        {
        }

        public override Encoding Encoding
        {
            get { return Encoding.UTF8; }
        }
    }

Now I just need to use StringWriterUtf8 in my code instead of StringWriter and I’ll be using UTF-8 encoding in my writer.  I did it, I tested it, and it worked.  My XML output now looks like this:

    <?xml version="1.0" encoding="utf-8" standalone="yes" ?>
    <jobs>
      <job>
        <title>Audit Engagement Manager</title>

Thank you Ian for understanding more about this than I do and taking the time to put the solution out on your blog.  By the way, I think it’s worth mentioning that several people posted more “settings type” solutions in the comments to Ian’s post.  I tried them.  They didn’t work for me.  The only thing that did work was Ian’s idea of subclassing the StringWriter.  So I now have a working method and here’s the final version of GetFeedXml that uses my StringWriterUtf8 class.

        // GetFeedXML
        public virtual string GetFeedXML()
        {
            List<JobFeedItem> list = GetFeedData();
            XDocument xdoc = BuildXmlFor(list);
            var sb = new StringBuilder();
            var sw = new StringWriterUtf8(sb);
            xdoc.Save(sw);
            return sb.ToString();
        }

9 comments:

  1. Yeah I remember this one :( make me crazy too
    Thanks for the share you have a GREAT blog,keep the good work ;)

    ReplyDelete
  2. Thanks, got it working !

    ReplyDelete
  3. StringBuilders are based on strings, and strings in .net are UTF16, so you can declare a separate encoding, but the string is still UTF16. So what's happening is the xml writer is just responding to what the actual encoding is.

    For real UTF8 encoding you must use a memory stream, and then call GetBytes() to return an array of char[]. If you just ToString() it, you end up getting .net to re-convert the utf8 characters back to utf16.

    ReplyDelete
  4. Thanks Anonymous, that makes sense. So it sounds like the approach that I used will work as long as I'm not using any crazy characters outside of the UTF8 spec, but it's not really correct because technically the encoding is still UTF16.

    ReplyDelete
  5. Thank You!!! It works!!!!!

    ReplyDelete
  6. thanks very much....I am a java dev and working on .net first time and your blog helped me resolve my issue.... cheers mate..!!

    ReplyDelete
  7. Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write ups thanks once again.

    ReplyDelete
  8. Thank you. You made my day!

    ReplyDelete