----------------------------------------
an extract from an article by Sarah Ellis
-----------------------------------------
In this era of globalization, the ability for systems to be able to handle data from
around the world is becoming paramount. However, workstations and servers can use
different code pages, depending on the native language of the workstation user. In
effect the workstation and servers are speaking “different languages” and this makes
communication difficult.
For example, if a workstation inserts some data into a DB2 for z/OS system, the data
is converted fro m ASCII to EBCDIC using a conversion table, which maps the code
points from the source (ASCII) CCSID to the target (EBCDIC) CCSID.
In addition to a conversion cost, a more serious issue is the potential loss of
characters. For example, if a Japanese workstation were inserting data into a
European DB2 system, many characters would not have a code point in the CCSID used
by DB2. Either the characters must be lost (enforced subset conversions) or DB2 must
map them to code points that are not already used (a round trip conversion). The
problem with the second option is that another system reading the data will not know
about this mapping and may not read the data correctly, perhaps mapping the
characters to some of its own characters.
The design objective of Unicode is to avoid these issues by having a single code page
that has a code point mapping for every character in the world. The Unicode
Consortium has devised a number of Universal Transformation Formats (UTFs) which
include unique code points for most current and historical languages, mathematical and
scientific symbols, and can be extended as new characters emerge. These UTFs have
become widely accepted, being used by technologies such as Java, XML and LDAP.
Many consider Unicode as the foundation for globalization of data and it is becoming a
strategic direction for many companies. For example, Microsoft has adopted Unicode
with products such as Word by storing data in Unicode and by providing Unicode APIs
for ODBC
Note:
Unicode only affects character data or numeric data stored as characters
ie. CHAR, VARCHAR, GRAPHIC, VARGRAPHIC, CLOB, DBCLOB. Numeric data stored as
as binary, packed or floating point are not affected.
Wednesday, January 24, 2007
What is an encoding scheme?
An encoding scheme is a collection of code pages (CCSIDs)for various languages used on a particular computing platform. For example, the EBCDIC encoding scheme is used on Z/OS and i-series systems. The ASCII encoding scheme is used on Intel-based(Windows) systems and Unix based systems.
What is CCSID - Coded Character Set IDentifier?
A CCSID is a number to identify a particular code page. For example, North Americans use the US-English code page denoted by a CCSID 037. Germans use the CCSID 273 that includes code points for specific characters in their language such as letters with umlauts. Other examples are 1252 which is an ASCII CCSID used on Windows platform and 1208 which represents the unicode transformation format UTF-8.
What are Code points?
All data is stored as bytes. For example, in our DB2 for Z/OS (which uses EBCDIC) systems, the character 'a' is being stored as X'81', the character 'A' is stored as X'C1' and the character representation of number '1' is X'F1'.
These byte representations for characters are called Code points.
These byte representations for characters are called Code points.
Sunday, January 07, 2007
Subscribe to:
Posts (Atom)