If you dont know anything about the byte order mark, Unicode and charactersets, I encorage you to take a good reading here : joelonsoftware.com
An extract from msdn docs :
Byte Order Mark
Always prefix a Unicode plain text file with a byte order mark, which informs an application receiving the file that the file is byte-ordered. Because Unicode plain text is a sequence of 16-bit code values, it is sensitive to the byte ordering used when the text is written.
The following table lists the available byte order marks.
|
Byte order mark
|
Description
|
|
EF BB BF |
UTF-8 |
|
FF FE |
UTF-16, little endian |
|
FE FF |
UTF-16, big endian |
| FF FE 00 00 |
UTF-32, little endian |
|
00 00 FE FF |
UTF-32, big-endian |
You can read more about the above and what exactly is the byte order mark here : msdn
Here I want to discuss how to make an xml stream response in .net, while including a byte order mark to notify the XML processor in what encoding the file is in, allowing the processor to avoid guessing.
The byte order mark is an encoding signature, not part of either the markup or the character data of the XML document. XML
Processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
so now, what should happen in the abscense of a byte order mark as described by the w3c ?
In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration. The text declaration must be provided literally, not by reference to a parsed entity. No text declaration may appear at any position other than the beginning of an external parsed entity eg.
|
<XML version="1.0" encoding="UTF-16" ?>
|
Now, what happens if the encoding is omitted as per the w3c guidelines ?
In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is an error for an entity including an
encoding declaration to be presented to the XML processor in an
encoding other than that named in the declaration, or for an entity
which begins with neither a Byte Order Mark nor an encoding declaration
to use an encoding other than UTF-8. Note that since ASCII is a subset
of UTF-8, ordinary ASCII entities do not strictly need an encoding
declaration.
Note, where its mentioned -->> other than UTF-8, so thats the
default encoding it will fallback to and your document better be in
UTF-8 encoding in this case.
Now, what happens if the encoding is omitted in the XmlTextReader class in .net ?
If no encoding attribute exists, and there is no byte-order mark, this defaults to UTF-8.
ok, and what if no byte order mark is present but an encoding is declared in the document as in
|
<XML version="1.0" encoding="UTF-16" ?>
|
how does the processor know how to read the first lines of the declaration to get to the encoding attribute ?
F Autodetection of Character Encodings (Non-Normative) ; Extracted from http://www.w3.org/TR/2000/REC-xml-20001006.html#sec-guessing
The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML entity is presented to the processor without, or with, any accompanying (external) information.
For more on the two guessing cases above, have a read here -->> http://www.w3.org/TR/2000/REC-xml-20001006.html#sec-guessing
ok, now how to set the byte order mark for your xml documents in .net and avoid all the guessing ?
In .net when using the XmlReader class, it scans the first bytes of the stream looking for a byte order mark or other sign of encoding. When encoding is determined, the encoding is used to continue reading the stream, and processing continues parsing the input as a stream of (Unicode) characters. Thats as far as a conforming processor goes, however when writing your xml documents you can set an encoding in the XmlTextWriter constructor, which takes an encoding parameter. Pretty simple :
| C# |
|
XmlTextWriter w = new XmlTextWriter( Response.OutputStream, new UnicodeEncoding(true, true, true));
|
As you may have guessed already, the constructors for the UnicodeEncoding class take parameters which specify whether to use the big-endian byte order and whether to provide a Unicode byte order mark.
Or
You can use the various other classes eg, System.IO.StreamWriter class takes an encoding. If its an email, you can set this in the System.Net.MailMessage.BodyEncoding and System.Net.MailMessage.BodyEncoding & MailMessage.SubjectEncoding properties.
oh, and another thing you will note above is that i have used the XmlTextWriter directly, however on msdn you will find the following note in this class :
In the Microsoft .NET Framework version 2.0 release, the recommended practice is to create XmlWriter instances using the System.Xml.XmlWriter.Create method and the XmlWriterSettings class. This allows you to take full advantage of all the new features introduced in this release.
So if you are using the XmlWriter.Create method, then use one of the contructors that take an XmlWriterSettings. XmlWriterSettings exposes an Encoding property which you can set.
| ASP.NET |
|
<%@ Page Language="C#" %> <%@ Import Namespace="System.Xml" %> <script runat="server">
|
|
protected void Page_Load(object sender, EventArgs e) { UTF8Encoding ue = new UTF8Encoding(true, true); XmlTextWriter w = new XmlTextWriter(Response.OutputStream, ue); w.WriteStartDocument(); w.Formatting = Formatting.Indented; w.WriteStartElement("x", "root", "urn:1"); w.WriteStartElement("y", "item", "urn:1"); w.WriteAttributeString("attr", "urn:1", "123"); w.WriteEndElement(); w.WriteEndElement();
Encoding unicode = ue; // Get the preamble for the Unicode encoder. // In this case the preamble contains the byte order mark (BOM). byte[] preamble = unicode.GetPreamble();
string s = " No preamble provided"; // Make sure a preamble was returned // and is large enough to containa BOM. if (preamble.Length >= 2) { s = " Encoding used is " + unicode.BodyName + " The Preamble, that is byte order mark is : "; s = ShowArray(preamble, s, unicode); s += " and in hex format is: "; foreach (byte b in preamble) { s += string.Format("{0:X2} ", b); } if (preamble[0] == 0xFE && preamble[1] == 0xFF) { s += " The Unicode encoder is encoding in" + " big-endian order."; } else if (preamble[0] == 0xFF && preamble[1] == 0xFE) { s += " The Unicode encoder is encoding in" + " little-endian order."; } } Response.Write(s.Trim()); w.Flush(); }
private string ShowArray(Array theArray, string s, Encoding unicode) { foreach (Object o in theArray) { s += string.Format("[{0}]", o); } return s; }
|
output :
Encoding used is utf-8 The Preamble, that is byte order mark is : [239][187][191] and in hex format is: EF BB BF
|
<xml version="1.0" encoding="utf-8"?> <x:root xmlns:x="urn:1"> <y:item y:attr="123" xmlns:y="urn:1" /> <x:root>
|
Also note that by default, the HttpResponse.HeaderEncoding has a default encoding value set to UTF8Encoding.