Sunday, January 29, 2012

7 Clues to Solve Character Encoding Issues


Every so often as the Tridion CMS content and design are weaved by Editors and Developers, I encounter the unexpected character encoding issue.  The published page has a quirky A or funny U, drawing attention to syntax and aspect rather than the pertinent content it should.
The simple explanation for why this can happen is that the issue is due to discrepancies in character encoding settings of the various systems the published content will pass in its journey to its final destination.  Here are several checkpoints I follow through when I play detective and what I look for to solve the mystery:
1. Publication Target (Tridion): What setting has been selected for the Publication Target Default Code Page value?  I check this value in the Publication Target properties of the Tridion CMS Admin Panel. By default, this is set to “System Default” which will acquire the code settings dictated by the Windows operating system of the publisher machine.  I usually change this to Unicode (UTF-8)[1].
2. Browser: What is the browser using for its character set?  In Internet Explorer I check View Encoding and look to see that the Unicode (UTF-8) menu item is marked on.  In Firefox, check Options Content Fonts & Colors Advanced Default Character Encoding.  In Google Chrome Options Under the Hood Web Content Customize fonts Encoding
3. Java Virtual Machine: What JVM does the Tridion Deployer run in (for instance one used in Tomcat), and what encodings are set there? As from JDK 1.4 it is possible to find out what is supported by a particular JVM via java.nio.charset. Charset. availableCharsets()[2].
4. Application JVM: 
·         Is the IDE used forcing a specific encoding, for instance if I’m using Eclipse? 
·         Is any operation depending on the standard locale for character I/O carrying along the correct encoding, for example when reading a file? Reader r = new InputStreamReader(new FileInputStream("myfile"), "UTF-8");
·         Tridion Deployer: if running on a file system, consider running the deployer with -Dfile.encoding=UTF8 command options
5. Web Servers: Decoding onward the trail, despite all of the above, most web servers are happily unaware of any encodings or treat the communication channel as ISO-8859-1, so another two checkpoints in one is at the level of webservers such as IIS, Tomcat or Sun Java System Application Server.  Did you know that depending on the webserver even the requests GET and POST themselves can be treated differently by the same webserver?  Beware these settings are server dependent and while Sun’s JSAS will treat both GET and POST the same based on one configuration, Tomcat may not, and IIS will expect the individual settings to be specified[3].
·         IIS/.NET web.config: <globalization fileEncoding="UTF-8" requestEncoding="UTF-8" responseEncoding="UTF-8"/>
·         Tomcat server.xml: set URIEncoding="UTF-8"
·         Sun Java System Application Server sun-web.xml: include <parameter-encoding default-charset="UTF-8"/>
6. Page level can override encoding directives in HTTP header settings in:
·         HTML
<meta http-equiv="Content Type" content="text/html; charset=UTF-8" /> 
·         .NET
<% @ Page ResponseEncoding="utf-8" %>
·         Java/JSP
<%@page pageEncoding="UTF-8"%>
<%@page contentType="text/html;charset=UTF-8"%>
request.setCharacterEncoding("UTF-8");
·         XML
<?xml version="1.0" encoding="UTF-8"?>
7. Create own abstract layer to interact with CM, also for overriding server settings.  If step 5 has given you visions of long and dark nights bravely searching your server’s documentation for that minuscule setting, there is light at the end of the tunnel.  Put your magnifying-glass away.  It is possible to establish a server-independent encoding layer.
Consider setting a context parameter in WEB-INF/web.xml and propagate this throughout your code by reading it before any other parameters and passing it along in extensions of the request object for both GET and POST methods.
Here’s hoping data fidelity serves you right, and happy encoding.


[1] http://sdllivecontent.sdl.com/LiveContent/content/en-US/SDL_Tridion_2011/concept_879633C70905448885956711778D2C0E