-This is doc/cppinternals.info, produced by makeinfo version 4.5 from
-doc/cppinternals.texi.
+This is doc/cppinternals.info, produced by makeinfo version 4.13 from
+/d/gcc-4.4.3/gcc-4.4.3/gcc/doc/cppinternals.texi.
-INFO-DIR-SECTION Programming
+INFO-DIR-SECTION Software development
START-INFO-DIR-ENTRY
* Cpplib: (cppinternals). Cpplib internals.
END-INFO-DIR-ENTRY
This file documents the internals of the GNU C Preprocessor.
- Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
+ Copyright 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software
+Foundation, Inc.
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
\1f
File: cppinternals.info, Node: Top, Next: Conventions, Up: (dir)
+The GNU C Preprocessor Internals
+********************************
+1 Cpplib--the GNU C Preprocessor
+********************************
-Cpplib--the GNU C Preprocessor
-******************************
-
- The GNU C preprocessor in GCC 3.x has been completely rewritten. It
-is now implemented as a library, "cpplib", so it can be easily shared
-between a stand-alone preprocessor, and a preprocessor integrated with
-the C, C++ and Objective-C front ends. It is also available for use by
-other programs, though this is not recommended as its exposed interface
-has not yet reached a point of reasonable stability.
+The GNU C preprocessor is implemented as a library, "cpplib", so it can
+be easily shared between a stand-alone preprocessor, and a preprocessor
+integrated with the C, C++ and Objective-C front ends. It is also
+available for use by other programs, though this is not recommended as
+its exposed interface has not yet reached a point of reasonable
+stability.
The library has been written to be re-entrant, so that it can be used
to preprocess many files simultaneously if necessary. It has also been
* Line Numbering:: Tracking location within files.
* Guard Macros:: Optimizing header files with guard macros.
* Files:: File handling.
-* Index:: Index.
+* Concept Index:: Index.
\1f
File: cppinternals.info, Node: Conventions, Next: Lexer, Prev: Top, Up: Top
Conventions
***********
- cpplib has two interfaces--one is exposed internally only, and the
+cpplib has two interfaces--one is exposed internally only, and the
other is for both internal and external use.
The convention is that functions and types that are exposed to
multiple files internally are prefixed with `_cpp_', and are to be
-found in the file `cpphash.h'. Functions and types exposed to external
+found in the file `internal.h'. Functions and types exposed to external
clients are in `cpplib.h', and prefixed with `cpp_'. For historical
reasons this is no longer quite true, but we should strive to stick to
it.
Overview
========
- The lexer is contained in the file `cpplex.c'. It is a hand-coded
-lexer, and not implemented as a state machine. It can understand C, C++
-and Objective-C source code, and has been extended to allow reasonably
+The lexer is contained in the file `lex.c'. It is a hand-coded lexer,
+and not implemented as a state machine. It can understand C, C++ and
+Objective-C source code, and has been extended to allow reasonably
successful preprocessing of assembly language. The lexer does not make
an initial pass to strip out trigraphs and escaped newlines, but handles
them as they are encountered in a single pass of the input file. It
Lexing a token
==============
- Lexing of an individual token is handled by `_cpp_lex_direct' and
-its subroutines. In its current form the code is quite complicated,
-with read ahead characters and such-like, since it strives to not step
-back in the character stream in preparation for handling non-ASCII file
+Lexing of an individual token is handled by `_cpp_lex_direct' and its
+subroutines. In its current form the code is quite complicated, with
+read ahead characters and such-like, since it strives to not step back
+in the character stream in preparation for handling non-ASCII file
encodings. The current plan is to convert any such files to UTF-8
before processing them. This complexity is therefore unnecessary and
will be removed, so I'll not discuss it further here.
baz
This is a good example of the subtlety of getting token spacing
-correct in the preprocessor; there are plenty of tests in the test
-suite for corner cases like this.
+correct in the preprocessor; there are plenty of tests in the testsuite
+for corner cases like this.
The lexer is written to treat each of `\r', `\n', `\r\n' and `\n\r'
as a single new line indicator. This allows it to transparently
Lexing a line
=============
- When the preprocessor was changed to return pointers to tokens, one
+When the preprocessor was changed to return pointers to tokens, one
feature I wanted was some sort of guarantee regarding how long a
returned pointer remains valid. This is important to the stand-alone
preprocessor, the future direction of the C family front ends, and even
The tokens forming a macro's replacement list are collected by the
`#define' handler, and placed in storage that is only freed by
-`cpp_destroy'. So if a macro is expanded in our line of tokens, the
-pointers to the tokens of its expansion that we return will always
+`cpp_destroy'. So if a macro is expanded in the line of tokens, the
+pointers to the tokens of its expansion that are returned will always
remain valid. However, macros are a little trickier than that, since
they give rise to three sources of fresh tokens. They are the built-in
macros like `__LINE__', and the `#' and `##' operators for
Hash Nodes
**********
- When cpplib encounters an "identifier", it generates a hash code for
-it and stores it in the hash table. By "identifier" we mean tokens
-with type `CPP_NAME'; this includes identifiers in the usual C sense,
-as well as keywords, directive names, macro names and so on. For
-example, all of `pragma', `int', `foo' and `__GNUC__' are identifiers
-and hashed when lexed.
+When cpplib encounters an "identifier", it generates a hash code for it
+and stores it in the hash table. By "identifier" we mean tokens with
+type `CPP_NAME'; this includes identifiers in the usual C sense, as
+well as keywords, directive names, macro names and so on. For example,
+all of `pragma', `int', `foo' and `__GNUC__' are identifiers and hashed
+when lexed.
Each node in the hash table contain various information about the
identifier it represents. For example, its length and type. At any one
Macro Expansion Algorithm
*************************
- Macro expansion is a tricky operation, fraught with nasty corner
-cases and situations that render what you thought was a nifty way to
-optimize the preprocessor's expansion algorithm wrong in quite subtle
-ways.
+Macro expansion is a tricky operation, fraught with nasty corner cases
+and situations that render what you thought was a nifty way to optimize
+the preprocessor's expansion algorithm wrong in quite subtle ways.
I strongly recommend you have a good grasp of how the C and C++
standards require macros to be expanded before diving into this
Internal representation of macros
=================================
- The preprocessor stores macro expansions in tokenized form. This
-saves repeated lexing passes during expansion, at the cost of a small
+The preprocessor stores macro expansions in tokenized form. This saves
+repeated lexing passes during expansion, at the cost of a small
increase in memory consumption on average. The tokens are stored
contiguously in memory, so a pointer to the first one and a token count
is all you need to get the replacement list of a macro.
Macro expansion overview
========================
- The preprocessor maintains a "context stack", implemented as a
-linked list of `cpp_context' structures, which together represent the
-macro expansion state at any one time. The `struct cpp_reader' member
+The preprocessor maintains a "context stack", implemented as a linked
+list of `cpp_context' structures, which together represent the macro
+expansion state at any one time. The `struct cpp_reader' member
variable `context' points to the current top of this stack. The top
normally holds the unexpanded replacement list of the innermost macro
under expansion, except when cpplib is about to pre-expand an argument,
Scanning the replacement list for macros to expand
==================================================
- The C standard states that, after any parameters have been replaced
+The C standard states that, after any parameters have been replaced
with their possibly-expanded arguments, the replacement list is scanned
for nested macros. Further, any identifiers in the replacement list
that are not expanded during this scan are never again eligible for
Looking for a function-like macro's opening parenthesis
=======================================================
- Function-like macros only expand when immediately followed by a
+Function-like macros only expand when immediately followed by a
parenthesis. To do this cpplib needs to temporarily disable macros and
read the next token. Unfortunately, because of spacing issues (*note
Token Spacing::), there can be fake padding tokens in-between, and if
Marking tokens ineligible for future expansion
==============================================
- As discussed above, cpplib needs a way of marking tokens as
+As discussed above, cpplib needs a way of marking tokens as
unexpandable. Since the tokens cpplib handles are read-only once they
have been lexed, it instead makes a copy of the token and adds the flag
`NO_EXPAND' to the copy.
Token Spacing
*************
- First, let's look at an issue that only concerns the stand-alone
-preprocessor: we want to guarantee that re-reading its preprocessed
-output results in an identical token stream. Without taking special
-measures, this might not be the case because of macro substitution.
-For example:
+First, consider an issue that only concerns the stand-alone
+preprocessor: there needs to be a guarantee that re-reading its
+preprocessed output results in an identical token stream. Without
+taking special measures, this might not be the case because of macro
+substitution. For example:
#define PLUS +
#define EMPTY
and after each macro replacement, each argument replacement, and
additionally each token created by the `#' and `##' operators.
- Let's look at how the preprocessor gets whitespace output correct
+ Look at how the preprocessor gets whitespace output correct
normally. The `cpp_token' structure contains a flags byte, and one of
those flags is `PREV_WHITE'. This is flagged by the lexer, and
indicates that the token was preceded by whitespace of some form other
Here, two padding tokens are generated with sources the `foo' token
between the brackets, and the `bar' token from foo's replacement list,
-respectively. Clearly the first padding token is the one we should
-use, so our output code should contain a rule that the first padding
-token in a sequence is the one that matters.
+respectively. Clearly the first padding token is the one to use, so
+the output code should contain a rule that the first padding token in a
+sequence is the one that matters.
- But what if we happen to leave a macro expansion? Adjusting the
-above example slightly:
+ But what if a macro expansion is left? Adjusting the above example
+slightly:
#define foo bar
#define bar EMPTY baz
Just which line number anyway?
==============================
- There are three reasonable requirements a cpplib client might have
-for the line number of a token passed to it:
+There are three reasonable requirements a cpplib client might have for
+the line number of a token passed to it:
* The source line it was lexed on.
Representation of line numbers
==============================
- As mentioned above, cpplib stores with each token the line number
-that it was lexed on. In fact, this number is not the number of the
-line in the source file, but instead bears more resemblance to the
-number of the line in the translation unit.
+As mentioned above, cpplib stores with each token the line number that
+it was lexed on. In fact, this number is not the number of the line in
+the source file, but instead bears more resemblance to the number of the
+line in the translation unit.
The preprocessor maintains a monotonic increasing line count, which
is incremented at every new line character (and also at the end of any
The Multiple-Include Optimization
*********************************
- Header files are often of the form
+Header files are often of the form
#ifndef FOO
#define FOO
conditional block for the optimization to be on.
Note that whilst we are inside the conditional block, `mi_valid' is
-likely to be reset to `false', but this does not matter since the the
+likely to be reset to `false', but this does not matter since the
closing `#endif' restores it to `true' if appropriate.
Finally, since `_cpp_lex_direct' pops the file off the buffer stack
turned off.
\1f
-File: cppinternals.info, Node: Files, Next: Index, Prev: Guard Macros, Up: Top
+File: cppinternals.info, Node: Files, Next: Concept Index, Prev: Guard Macros, Up: Top
File Handling
*************
- Fairly obviously, the file handling code of cpplib resides in the
-file `cppfiles.c'. It takes care of the details of file searching,
-opening, reading and caching, for both the main source file and all the
-headers it recursively includes.
+Fairly obviously, the file handling code of cpplib resides in the file
+`files.c'. It takes care of the details of file searching, opening,
+reading and caching, for both the main source file and all the headers
+it recursively includes.
The basic strategy is to minimize the number of system calls. On
many systems, the basic `open ()' and `fstat ()' system calls can be
the file minus the base name.
\1f
-File: cppinternals.info, Node: Index, Prev: Files, Up: Top
+File: cppinternals.info, Node: Concept Index, Prev: Files, Up: Top
-Index
-*****
+Concept Index
+*************
+\0\b[index\0\b]
* Menu:
-* assertions: Hash Nodes.
-* controlling macros: Guard Macros.
-* escaped newlines: Lexer.
-* files: Files.
-* guard macros: Guard Macros.
-* hash table: Hash Nodes.
-* header files: Conventions.
-* identifiers: Hash Nodes.
-* interface: Conventions.
-* lexer: Lexer.
-* line numbers: Line Numbering.
-* macro expansion: Macro Expansion.
-* macro representation (internal): Macro Expansion.
-* macros: Hash Nodes.
-* multiple-include optimization: Guard Macros.
-* named operators: Hash Nodes.
-* newlines: Lexer.
-* paste avoidance: Token Spacing.
-* spacing: Token Spacing.
-* token run: Lexer.
-* token spacing: Token Spacing.
+* assertions: Hash Nodes. (line 6)
+* controlling macros: Guard Macros. (line 6)
+* escaped newlines: Lexer. (line 6)
+* files: Files. (line 6)
+* guard macros: Guard Macros. (line 6)
+* hash table: Hash Nodes. (line 6)
+* header files: Conventions. (line 6)
+* identifiers: Hash Nodes. (line 6)
+* interface: Conventions. (line 6)
+* lexer: Lexer. (line 6)
+* line numbers: Line Numbering. (line 6)
+* macro expansion: Macro Expansion. (line 6)
+* macro representation (internal): Macro Expansion. (line 19)
+* macros: Hash Nodes. (line 6)
+* multiple-include optimization: Guard Macros. (line 6)
+* named operators: Hash Nodes. (line 6)
+* newlines: Lexer. (line 6)
+* paste avoidance: Token Spacing. (line 6)
+* spacing: Token Spacing. (line 6)
+* token run: Lexer. (line 192)
+* token spacing: Token Spacing. (line 6)
\1f
Tag Table:
-Node: Top\7f910
-Node: Conventions\7f2579
-Node: Lexer\7f3523
-Ref: Invalid identifiers\7f11446
-Ref: Lexing a line\7f13395
-Node: Hash Nodes\7f18168
-Node: Macro Expansion\7f21050
-Node: Token Spacing\7f30015
-Node: Line Numbering\7f35897
-Node: Guard Macros\7f39988
-Node: Files\7f44786
-Node: Index\7f48250
+Node: Top\7f971
+Node: Conventions\7f2656
+Node: Lexer\7f3598
+Ref: Invalid identifiers\7f11511
+Ref: Lexing a line\7f13460
+Node: Hash Nodes\7f18233
+Node: Macro Expansion\7f21112
+Node: Token Spacing\7f30059
+Node: Line Numbering\7f35919
+Node: Guard Macros\7f40004
+Node: Files\7f44795
+Node: Concept Index\7f48261
\1f
End Tag Table