X-Git-Url: https://oss.titaniummirror.com/gitweb/?a=blobdiff_plain;f=libstdc%2B%2B-v3%2Fdocs%2Fhtml%2F22_locale%2Fcodecvt.html;fp=libstdc%2B%2B-v3%2Fdocs%2Fhtml%2F22_locale%2Fcodecvt.html;h=0000000000000000000000000000000000000000;hb=6fed43773c9b0ce596dca5686f37ac3fc0fa11c0;hp=6acd416fc076e146da59ac3ed20dd93b5b21a76e;hpb=27b11d56b743098deb193d510b337ba22dc52e5c;p=msp430-gcc.git diff --git a/libstdc++-v3/docs/html/22_locale/codecvt.html b/libstdc++-v3/docs/html/22_locale/codecvt.html deleted file mode 100644 index 6acd416f..00000000 --- a/libstdc++-v3/docs/html/22_locale/codecvt.html +++ /dev/null @@ -1,590 +0,0 @@ - - - - - - - - - - Notes on the codecvt implementation. - - - -

- Notes on the codecvt implementation. -

- -prepared by Benjamin Kosnik (bkoz@redhat.com) on August 28, 2000 - -

- -

-1. Abstract -

-The standard class codecvt attempts to address conversions between -different character encoding schemes. In particular, the standard -attempts to detail conversions between the implementation-defined wide -characters (hereafter referred to as wchar_t) and the standard type -char that is so beloved in classic "C" (which can now be referred to -as narrow characters.) This document attempts to describe how the GNU -libstdc++-v3 implementation deals with the conversion between wide and -narrow characters, and also presents a framework for dealing with the -huge number of other encodings that iconv can convert, including -Unicode and UTF8. Design issues and requirements are addressed, and -examples of correct usage for both the required specializations for -wide and narrow characters and the implementation-provided extended -functionality are given. -

- -

-2. What the standard says -

-Around page 425 of the C++ Standard, this charming heading comes into view: - -

-22.2.1.5 - Template class codecvt [lib.locale.codecvt] -

- -The text around the codecvt definition gives some clues: - -

- --1- The class codecvt<internT,externT,stateT> is for use when -converting from one codeset to another, such as from wide characters -to multibyte characters, between wide character encodings such as -Unicode and EUC. - -

- -

-Hmm. So, in some unspecified way, Unicode encodings and -translations between other character sets should be handled by this -class. -

- -

- --2- The stateT argument selects the pair of codesets being mapped between. - -

- -

-Ah ha! Another clue... -

- -

- --3- The instantiations required in the Table ?? -(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and -codecvt<char,char,mbstate_t>, convert the implementation-defined -native character set. codecvt<char,char,mbstate_t> implements a -degenerate conversion; it does not convert at -all. codecvt<wchar_t,char,mbstate_t> converts between the native -character sets for tiny and wide characters. Instantiations on -mbstate_t perform conversion between encodings known to the library -implementor. Other encodings can be converted by specializing on a -user-defined stateT type. The stateT object can contain any state that -is useful to communicate to or from the specialized do_convert member. - -

- -

-At this point, a couple points become clear: -

- -

-One: The standard clearly implies that attempts to add non-required -(yet useful and widely used) conversions need to do so through the -third template parameter, stateT.

- -

-Two: The required conversions, by specifying mbstate_t as the third -template parameter, imply an implementation strategy that is mostly -(or wholly) based on the underlying C library, and the functions -mcsrtombs and wcsrtombs in particular.

- -

-3. Some thoughts on what would be useful -

-Probably the most frequently asked question about code conversion is: -"So dudes, what's the deal with Unicode strings?" The dude part is -optional, but apparently the usefulness of Unicode strings is pretty -widely appreciated. Sadly, this specific encoding (And other useful -encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned -in the C++ standard. - -

-In particular, the simple implementation detail of wchar_t's size -seems to repeatedly confound people. Many systems use a two byte, -unsigned integral type to represent wide characters, and use an -internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, Java, -others.) Other systems, use a four byte, unsigned integral type to -represent wide characters, and use an internal encoding of -UCS4. (GNU/Linux systems using glibc, in particular.) The C -programming language (and thus C++) does not specify a specific size -for the type wchar_t. -

- -

-Thus, portable C++ code cannot assume a byte size (or endianness) either. -

- -

-Getting back to the frequently asked question: What about Unicode strings? -

- -

-What magic spell will do this conversion? -

- -

-A couple of comments: -

- -

-The thought that all one needs to convert between two arbitrary -codesets is two types and some kind of state argument is -unfortunate. In particular, encodings may be stateless. The naming of -the third parameter as stateT is unfortunate, as what is really needed -is some kind of generalized type that accounts for the issues that -abstract encodings will need. The minimum information that is required -includes: -

- -

-
- Identifiers for each of the codesets involved in the conversion. For -example, using the iconv family of functions from the Single Unix -Specification (what used to be called X/Open) hosted on the GNU/Linux -operating system allows bi-directional mapping between far more than -the following tantalizing possibilities: -
- -
-(An edited list taken from `iconv --list` on a Red Hat 6.2/Intel system: -
- -
-
```
-8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,
-ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,
-GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,
-ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
-ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
-ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
-ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,
-UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,
-UTF-16, UTF8, UTF16).
-
```
-
- -
-For iconv-based implementations, string literals for each of the -encodings (ie. "UCS-2" and "UTF-8") are necessary, -although for other, -non-iconv implementations a table of enumerated values or some other -mechanism may be required. -
-
- Maximum length of the identifying string literal. -
- Some encodings are require explicit endian-ness. As such, some kind - of endian marker or other byte-order marker will be necessary. See - "Footnotes for C/C++ developers" in Haible for more information on - UCS-2/Unicode endian issues. (Summary: big endian seems most likely, - however implementations, most notably Microsoft, vary.) -
- Types representing the conversion state, for conversions involving - the machinery in the "C" library, or the conversion descriptor, for - conversions using iconv (such as the type iconv_t.) Note that the - conversion descriptor encodes more information than a simple encoding - state type. -
- Conversion descriptors for both directions of encoding. (ie, both - UCS-2 to UTF-8 and UTF-8 to UCS-2.) -
- Something to indicate if the conversion requested if valid. -
- Something to represent if the conversion descriptors are valid. -
- Some way to enforce strict type checking on the internal and - external types. As part of this, the size of the internal and - external types will need to be known. -

- -

-4. Problems with "C" code conversions : thread safety, global -locales, termination. -

- -In addition, multi-threaded and multi-locale environments also impact -the design and requirements for code conversions. In particular, they -affect the required specialization codecvt<wchar_t, char, mbstate_t> -when implemented using standard "C" functions. - -

-Three problems arise, one big, one of medium importance, and one small. -

- -

-First, the small: mcsrtombs and wcsrtombs may not be multithread-safe -on all systems required by the GNU tools. For GNU/Linux and glibc, -this is not an issue. -

- -

-Of medium concern, in the grand scope of things, is that the functions -used to implement this specialization work on null-terminated -strings. Buffers, especially file buffers, may not be null-terminated, -thus giving conversions that end prematurely or are otherwise -incorrect. Yikes! -

- -

-The last, and fundamental problem, is the assumption of a global -locale for all the "C" functions referenced above. For something like -C++ iostreams (where codecvt is explicitly used) the notion of -multiple locales is fundamental. In practice, most users may not run -into this limitation. However, as a quality of implementation issue, -the GNU C++ library would like to offer a solution that allows -multiple locales and or simultaneous usage with computationally -correct results. In short, libstdc++-v3 is trying to offer, as an -option, a high-quality implementation, damn the additional complexity! -

- -

-For the required specialization codecvt<wchar_t, char, mbstate_t> , -conversions are made between the internal character set (always UCS4 -on GNU/Linux) and whatever the currently selected locale for the -LC_CTYPE category implements. -

- -

-5. Design -

-The two required specializations are implemented as follows: - -

--codecvt<char, char, mbstate_t> - -

-This is a degenerate (ie, does nothing) specialization. Implementing -this was a piece of cake. -

- -

--codecvt<char, wchar_t, mbstate_t> - -

-This specialization, by specifying all the template parameters, pretty -much ties the hands of implementors. As such, the implementation is -straightforward, involving mcsrtombs for the conversions between char -to wchar_t and wcsrtombs for conversions between wchar_t and char. -

- -

-Neither of these two required specializations deals with Unicode -characters. As such, libstdc++-v3 implements a partial specialization -of the codecvt class with and iconv wrapper class, __enc_traits as the -third template parameter. -

- -

-This implementation should be standards conformant. First of all, the -standard explicitly points out that instantiations on the third -template parameter, stateT, are the proper way to implement -non-required conversions. Second of all, the standard says (in Chapter -17) that partial specializations of required classes are a-ok. Third -of all, the requirements for the stateT type elsewhere in the standard -(see 21.1.2 traits typedefs) only indicate that this type be copy -constructible. -

- -

-As such, the type __enc_traits is defined as a non-templatized, POD -type to be used as the third type of a codecvt instantiation. This -type is just a wrapper class for iconv, and provides an easy interface -to iconv functionality. -

- -

-There are two constructors for __enc_traits: -

- -

--__enc_traits() : __in_desc(0), __out_desc(0) - -

-This default constructor sets the internal encoding to some default -(currently UCS4) and the external encoding to whatever is returned by -nl_langinfo(CODESET). -

- -

--__enc_traits(const char* __int, const char* __ext) - -

-This constructor takes as parameters string literals that indicate the -desired internal and external encoding. There are no defaults for -either argument. -

- -

-One of the issues with iconv is that the string literals identifying -conversions are not standardized. Because of this, the thought of -mandating and or enforcing some set of pre-determined valid -identifiers seems iffy: thus, a more practical (and non-migraine -inducing) strategy was implemented: end-users can specify any string -(subject to a pre-determined length qualifier, currently 32 bytes) for -encodings. It is up to the user to make sure that these strings are -valid on the target system. -

- -

--void -_M_init() - -

-Strangely enough, this member function attempts to open conversion -descriptors for a given __enc_traits object. If the conversion -descriptors are not valid, the conversion descriptors returned will -not be valid and the resulting calls to the codecvt conversion -functions will return error. -

- -

--bool -_M_good() - -

-Provides a way to see if the given __enc_traits object has been -properly initialized. If the string literals describing the desired -internal and external encoding are not valid, initialization will -fail, and this will return false. If the internal and external -encodings are valid, but iconv_open could not allocate conversion -descriptors, this will also return false. Otherwise, the object is -ready to convert and will return true. -

- -

--__enc_traits(const __enc_traits&) - -

-As iconv allocates memory and sets up conversion descriptors, the copy -constructor can only copy the member data pertaining to the internal -and external code conversions, and not the conversion descriptors -themselves. -

- -

-Definitions for all the required codecvt member functions are provided -for this specialization, and usage of codecvt<internal character type, -external character type, __enc_traits> is consistent with other -codecvt usage. -

- -

-6. Examples -

- -

- a. conversions involving string literals - -

-  typedef codecvt_base::result                  result;
-  typedef unsigned short                        unicode_t;
-  typedef unicode_t                             int_type;
-  typedef char                                  ext_type;
-  typedef __enc_traits                          enc_type;
-  typedef codecvt<int_type, ext_type, enc_type> unicode_codecvt;
-
-  const ext_type*       e_lit = "black pearl jasmine tea";
-  int                   size = strlen(e_lit);
-  int_type              i_lit_base[24] = 
-  { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184, 
-    27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
-    25856, 24832, 2560
-  };
-  const int_type*       i_lit = i_lit_base;
-  const ext_type*       efrom_next;
-  const int_type*       ifrom_next;
-  ext_type*             e_arr = new ext_type[size + 1];
-  ext_type*             eto_next;
-  int_type*             i_arr = new int_type[size + 1];
-  int_type*             ito_next;
-
-  // construct a locale object with the specialized facet.
-  locale                loc(locale::classic(), new unicode_codecvt);
-  // sanity check the constructed locale has the specialized facet.
-  VERIFY( has_facet<unicode_codecvt>(loc) );
-  const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc); 
-  // convert between const char* and unicode strings
-  unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
-  initialize_state(state01);
-  result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next, 
-                     i_arr, i_arr + size, ito_next);
-  VERIFY( r1 == codecvt_base::ok );
-  VERIFY( !int_traits::compare(i_arr, i_lit, size) ); 
-  VERIFY( efrom_next == e_lit + size );
-  VERIFY( ito_next == i_arr + size );
-

- b. conversions involving std::string -
- c. conversions involving std::filebuf and std::ostream -

- -More information can be found in the following testcases: -

testsuite/22_locale/codecvt_char_char.cc
testsuite/22_locale/codecvt_unicode_wchar_t.cc
testsuite/22_locale/codecvt_unicode_char.cc
testsuite/22_locale/codecvt_wchar_t_char.cc

- -

-7. Unresolved Issues -

- a. things that are sketchy, or remain unimplemented: - do_encoding, max_length and length member functions - are only weakly implemented. I have no idea how to do - this correctly, and in a generic manner. Nathan? -
- b. conversions involving std::string - -
- - how should operators != and == work for string of - different/same encoding? -
- - what is equal? A byte by byte comparison or an - encoding then byte comparison? -
- - conversions between narrow, wide, and unicode strings -
-
- c. conversions involving std::filebuf and std::ostream -
- - how to initialize the state object in a - standards-conformant manner? -
- - how to synchronize the "C" and "C++" - conversion information? -
- - wchar_t/char internal buffers and conversions between - internal/external buffers? -
-

- -

-8. Acknowledgments -

-Ulrich Drepper for the iconv suggestions and patient answering of -late-night questions, Jason Merrill for the template partial -specialization hints, language clarification, and wchar_t fixes. - -

-9. Bibliography / Referenced Documents -

- -Drepper, Ulrich, GNU libc (glibc) 2.2 manual. In particular, Chapters "6. Character Set Handling" and "7 Locales and Internationalization" - -

-Drepper, Ulrich, Numerous, late-night email correspondence -

- -

-Feather, Clive, "A brief description of Normative Addendum 1," in particular the parts on Extended Character Sets -http://www.lysator.liu.se/c/na1.html -

- -

-Haible, Bruno, "The Unicode HOWTO" v0.18, 4 August 2000 -ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html -

- -

-ISO/IEC 14882:1998 Programming languages - C++ -

- -

-ISO/IEC 9899:1999 Programming languages - C -

- -

-Khun, Markus, "UTF-8 and Unicode FAQ for Unix/Linux" -http://www.cl.cam.ac.uk/~mgk25/unicode.html -

- -

-Langer, Angelika and Klaus Kreft, Standard C++ IOStreams and Locales, Advanced Programmer's Guide and Reference, Addison Wesley Longman, Inc. 2000 -

- -

-Stroustrup, Bjarne, Appendix D, The C++ Programming Language, Special Edition, Addison Wesley, Inc. 2000 -

- -

-System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x) -The Open Group/The Institute of Electrical and Electronics Engineers, Inc. -http://www.opennc.org/austin/docreg.html -

- - -