X-Git-Url: https://oss.titaniummirror.com/gitweb/?a=blobdiff_plain;f=libstdc%2B%2B-v3%2Fdoc%2Fxml%2Fmanual%2Fcodecvt.xml;fp=libstdc%2B%2B-v3%2Fdoc%2Fxml%2Fmanual%2Fcodecvt.xml;h=c836f9d0a53e711bd341925153b2a365c7176b4b;hb=6fed43773c9b0ce596dca5686f37ac3fc0fa11c0;hp=0000000000000000000000000000000000000000;hpb=27b11d56b743098deb193d510b337ba22dc52e5c;p=msp430-gcc.git diff --git a/libstdc++-v3/doc/xml/manual/codecvt.xml b/libstdc++-v3/doc/xml/manual/codecvt.xml new file mode 100644 index 00000000..c836f9d0 --- /dev/null +++ b/libstdc++-v3/doc/xml/manual/codecvt.xml @@ -0,0 +1,730 @@ + + + + + + + ISO C++ + + + codecvt + + + + +codecvt + + +The standard class codecvt attempts to address conversions between +different character encoding schemes. In particular, the standard +attempts to detail conversions between the implementation-defined wide +characters (hereafter referred to as wchar_t) and the standard type +char that is so beloved in classic C (which can now be +referred to as narrow characters.) This document attempts to describe +how the GNU libstdc++ implementation deals with the conversion between +wide and narrow characters, and also presents a framework for dealing +with the huge number of other encodings that iconv can convert, +including Unicode and UTF8. Design issues and requirements are +addressed, and examples of correct usage for both the required +specializations for wide and narrow characters and the +implementation-provided extended functionality are given. + + + +Requirements + + +Around page 425 of the C++ Standard, this charming heading comes into view: + + +
+ +22.2.1.5 - Template class codecvt + +
+ + +The text around the codecvt definition gives some clues: + + +
+ + +-1- The class codecvt<internT,externT,stateT> is for use when +converting from one codeset to another, such as from wide characters +to multibyte characters, between wide character encodings such as +Unicode and EUC. + + +
+ + +Hmm. So, in some unspecified way, Unicode encodings and +translations between other character sets should be handled by this +class. + + +
+ + +-2- The stateT argument selects the pair of codesets being mapped between. + + +
+ + +Ah ha! Another clue... + + +
+ + +-3- The instantiations required in the Table ?? +(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and +codecvt<char,char,mbstate_t>, convert the implementation-defined +native character set. codecvt<char,char,mbstate_t> implements a +degenerate conversion; it does not convert at +all. codecvt<wchar_t,char,mbstate_t> converts between the native +character sets for tiny and wide characters. Instantiations on +mbstate_t perform conversion between encodings known to the library +implementor. Other encodings can be converted by specializing on a +user-defined stateT type. The stateT object can contain any state that +is useful to communicate to or from the specialized do_convert member. + + +
+ + +At this point, a couple points become clear: + + + +One: The standard clearly implies that attempts to add non-required +(yet useful and widely used) conversions need to do so through the +third template parameter, stateT. + + +Two: The required conversions, by specifying mbstate_t as the third +template parameter, imply an implementation strategy that is mostly +(or wholly) based on the underlying C library, and the functions +mcsrtombs and wcsrtombs in particular. +
+ + +Design + + + <type>wchar_t</type> Size + + + The simple implementation detail of wchar_t's size seems to + repeatedly confound people. Many systems use a two byte, + unsigned integral type to represent wide characters, and use an + internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, + Java, others.) Other systems, use a four byte, unsigned integral + type to represent wide characters, and use an internal encoding + of UCS4. (GNU/Linux systems using glibc, in particular.) The C + programming language (and thus C++) does not specify a specific + size for the type wchar_t. + + + + Thus, portable C++ code cannot assume a byte size (or endianness) either. + + + + + Support for Unicode + + Probably the most frequently asked question about code conversion + is: "So dudes, what's the deal with Unicode strings?" + The dude part is optional, but apparently the usefulness of + Unicode strings is pretty widely appreciated. Sadly, this specific + encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10, + etc etc etc) are not mentioned in the C++ standard. + + + + A couple of comments: + + + + The thought that all one needs to convert between two arbitrary + codesets is two types and some kind of state argument is + unfortunate. In particular, encodings may be stateless. The naming + of the third parameter as stateT is unfortunate, as what is really + needed is some kind of generalized type that accounts for the + issues that abstract encodings will need. The minimum information + that is required includes: + + + + + + Identifiers for each of the codesets involved in the + conversion. For example, using the iconv family of functions + from the Single Unix Specification (what used to be called + X/Open) hosted on the GNU/Linux operating system allows + bi-directional mapping between far more than the following + tantalizing possibilities: + + + + (An edited list taken from `iconv --list` on a + Red Hat 6.2/Intel system: + + +
+ +8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7, +ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD, +GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3, +ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, +ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, +ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4, +ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4, +UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8, +UTF-16, UTF8, UTF16). + +
+ + +For iconv-based implementations, string literals for each of the +encodings (i.e. "UCS-2" and "UTF-8") are necessary, +although for other, +non-iconv implementations a table of enumerated values or some other +mechanism may be required. + +
+ + + Maximum length of the identifying string literal. + + + + Some encodings require explicit endian-ness. As such, some kind + of endian marker or other byte-order marker will be necessary. See + "Footnotes for C/C++ developers" in Haible for more information on + UCS-2/Unicode endian issues. (Summary: big endian seems most likely, + however implementations, most notably Microsoft, vary.) + + + + Types representing the conversion state, for conversions involving + the machinery in the "C" library, or the conversion descriptor, for + conversions using iconv (such as the type iconv_t.) Note that the + conversion descriptor encodes more information than a simple encoding + state type. + + + + Conversion descriptors for both directions of encoding. (i.e., both + UCS-2 to UTF-8 and UTF-8 to UCS-2.) + + + + Something to indicate if the conversion requested if valid. + + + + Something to represent if the conversion descriptors are valid. + + + + Some way to enforce strict type checking on the internal and + external types. As part of this, the size of the internal and + external types will need to be known. + +
+
+ + + Other Issues + +In addition, multi-threaded and multi-locale environments also impact +the design and requirements for code conversions. In particular, they +affect the required specialization codecvt<wchar_t, char, mbstate_t> +when implemented using standard "C" functions. + + + +Three problems arise, one big, one of medium importance, and one small. + + + +First, the small: mcsrtombs and wcsrtombs may not be multithread-safe +on all systems required by the GNU tools. For GNU/Linux and glibc, +this is not an issue. + + + +Of medium concern, in the grand scope of things, is that the functions +used to implement this specialization work on null-terminated +strings. Buffers, especially file buffers, may not be null-terminated, +thus giving conversions that end prematurely or are otherwise +incorrect. Yikes! + + + +The last, and fundamental problem, is the assumption of a global +locale for all the "C" functions referenced above. For something like +C++ iostreams (where codecvt is explicitly used) the notion of +multiple locales is fundamental. In practice, most users may not run +into this limitation. However, as a quality of implementation issue, +the GNU C++ library would like to offer a solution that allows +multiple locales and or simultaneous usage with computationally +correct results. In short, libstdc++ is trying to offer, as an +option, a high-quality implementation, damn the additional complexity! + + + +For the required specialization codecvt<wchar_t, char, mbstate_t> , +conversions are made between the internal character set (always UCS4 +on GNU/Linux) and whatever the currently selected locale for the +LC_CTYPE category implements. + + + + +
+ + +Implementation + + +The two required specializations are implemented as follows: + + + + +codecvt<char, char, mbstate_t> + + + +This is a degenerate (i.e., does nothing) specialization. Implementing +this was a piece of cake. + + + + +codecvt<char, wchar_t, mbstate_t> + + + + +This specialization, by specifying all the template parameters, pretty +much ties the hands of implementors. As such, the implementation is +straightforward, involving mcsrtombs for the conversions between char +to wchar_t and wcsrtombs for conversions between wchar_t and char. + + + +Neither of these two required specializations deals with Unicode +characters. As such, libstdc++ implements a partial specialization +of the codecvt class with and iconv wrapper class, encoding_state as the +third template parameter. + + + +This implementation should be standards conformant. First of all, the +standard explicitly points out that instantiations on the third +template parameter, stateT, are the proper way to implement +non-required conversions. Second of all, the standard says (in Chapter +17) that partial specializations of required classes are a-ok. Third +of all, the requirements for the stateT type elsewhere in the standard +(see 21.1.2 traits typedefs) only indicate that this type be copy +constructible. + + + +As such, the type encoding_state is defined as a non-templatized, POD +type to be used as the third type of a codecvt instantiation. This +type is just a wrapper class for iconv, and provides an easy interface +to iconv functionality. + + + +There are two constructors for encoding_state: + + + + +encoding_state() : __in_desc(0), __out_desc(0) + + + +This default constructor sets the internal encoding to some default +(currently UCS4) and the external encoding to whatever is returned by +nl_langinfo(CODESET). + + + + +encoding_state(const char* __int, const char* __ext) + + + + +This constructor takes as parameters string literals that indicate the +desired internal and external encoding. There are no defaults for +either argument. + + + +One of the issues with iconv is that the string literals identifying +conversions are not standardized. Because of this, the thought of +mandating and or enforcing some set of pre-determined valid +identifiers seems iffy: thus, a more practical (and non-migraine +inducing) strategy was implemented: end-users can specify any string +(subject to a pre-determined length qualifier, currently 32 bytes) for +encodings. It is up to the user to make sure that these strings are +valid on the target system. + + + + +void +_M_init() + + + +Strangely enough, this member function attempts to open conversion +descriptors for a given encoding_state object. If the conversion +descriptors are not valid, the conversion descriptors returned will +not be valid and the resulting calls to the codecvt conversion +functions will return error. + + + + +bool +_M_good() + + + + +Provides a way to see if the given encoding_state object has been +properly initialized. If the string literals describing the desired +internal and external encoding are not valid, initialization will +fail, and this will return false. If the internal and external +encodings are valid, but iconv_open could not allocate conversion +descriptors, this will also return false. Otherwise, the object is +ready to convert and will return true. + + + + +encoding_state(const encoding_state&) + + + + +As iconv allocates memory and sets up conversion descriptors, the copy +constructor can only copy the member data pertaining to the internal +and external code conversions, and not the conversion descriptors +themselves. + + + +Definitions for all the required codecvt member functions are provided +for this specialization, and usage of codecvt<internal character type, +external character type, encoding_state> is consistent with other +codecvt usage. + + + + + +Use +A conversions involving string literal. + + + typedef codecvt_base::result result; + typedef unsigned short unicode_t; + typedef unicode_t int_type; + typedef char ext_type; + typedef encoding_state state_type; + typedef codecvt<int_type, ext_type, state_type> unicode_codecvt; + + const ext_type* e_lit = "black pearl jasmine tea"; + int size = strlen(e_lit); + int_type i_lit_base[24] = + { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184, + 27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696, + 25856, 24832, 2560 + }; + const int_type* i_lit = i_lit_base; + const ext_type* efrom_next; + const int_type* ifrom_next; + ext_type* e_arr = new ext_type[size + 1]; + ext_type* eto_next; + int_type* i_arr = new int_type[size + 1]; + int_type* ito_next; + + // construct a locale object with the specialized facet. + locale loc(locale::classic(), new unicode_codecvt); + // sanity check the constructed locale has the specialized facet. + VERIFY( has_facet<unicode_codecvt>(loc) ); + const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc); + // convert between const char* and unicode strings + unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1"); + initialize_state(state01); + result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next, + i_arr, i_arr + size, ito_next); + VERIFY( r1 == codecvt_base::ok ); + VERIFY( !int_traits::compare(i_arr, i_lit, size) ); + VERIFY( efrom_next == e_lit + size ); + VERIFY( ito_next == i_arr + size ); + + + + + +Future + + + + a. things that are sketchy, or remain unimplemented: + do_encoding, max_length and length member functions + are only weakly implemented. I have no idea how to do + this correctly, and in a generic manner. Nathan? + + + + + + b. conversions involving std::string + + + + how should operators != and == work for string of + different/same encoding? + + + + what is equal? A byte by byte comparison or an + encoding then byte comparison? + + + + conversions between narrow, wide, and unicode strings + + + + + c. conversions involving std::filebuf and std::ostream + + + + how to initialize the state object in a + standards-conformant manner? + + + + how to synchronize the "C" and "C++" + conversion information? + + + + wchar_t/char internal buffers and conversions between + internal/external buffers? + + + + + + + + +Bibliography + + + + The GNU C Library + + + + McGrath + Roland + + + Drepper + Ulrich + + + + 2007 + FSF + + Chapters 6 Character Set Handling and 7 Locales and Internationalization + + + + + + Correspondence + + + + Drepper + Ulrich + + + + 2002 + + + + + + + ISO/IEC 14882:1998 Programming languages - C++ + + + + 1998 + ISO + + + + + + ISO/IEC 9899:1999 Programming languages - C + + + + 1999 + ISO + + + + + + System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x) + + + + 1999 + + The Open Group/The Institute of Electrical and Electronics Engineers, Inc. + + + + + + + + + + + + The C++ Programming Language, Special Edition + + + + Stroustrup + Bjarne + + + + 2000 + Addison Wesley, Inc. + + Appendix D + + + + Addison Wesley + + + + + + + + + Standard C++ IOStreams and Locales + + + Advanced Programmer's Guide and Reference + + + + Langer + Angelika + + + + Kreft + Klaus + + + + 2000 + Addison Wesley Longman, Inc. + + + + + Addison Wesley Longman + + + + + + + + A brief description of Normative Addendum 1 + + + + Feather + Clive + + + Extended Character Sets + + + + + + + + + + The Unicode HOWTO + + + + Haible + Bruno + + + + + + + + + + + + UTF-8 and Unicode FAQ for Unix/Linux + + + + Khun + Markus + + + + + + + + + + + + +