Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members
C# Parser Library
1. Introduction The Metaspec C# parser library is designed to
provide the kernel for applications which require processing C# source code:
refactoring, formatting, and verification applications, and other similar
tools.
Our C# parser is available in two versions: a
compilation parser and a transformation parser.
2. Input data Input data for the parser are source files in
the C# language and assemblies. The parser does not work directly with the file
system. All files must be loaded into memory and passed to the parser with the
help of pointers. The following abstract data types are used for
storing input data: files, assemblies and projects. A project can include any
number of files and assemblies. It is responsible for deleting files and
assemblies. When a project is deleted, all files and assemblies in it are also
automatically deleted. For each project you can specify configuration macros
that can include or exclude parts of the program. The parser can store and
process any number of independent projects. 3. Parsing stages Pass 1 – Creating AST trees. A separate tree is
created for each file. Metadata Pass – Generating and adding entities
defined in assemblies to the entity model. Pass 2 – Generating and adding entities defined
in C# files to the entity model. Find predefined entities – Searching for system
data types (System.Object, System.ValueType, etc.). System data types can be
defined both in C# files and in assemblies. Pass 3 – Building the inheritance hierarchy and
type lookup. Pass 4 – Searching for members of types. Method
overload resolution. Searching for explicit and implicit user-defined
conversions. Checking if the program is semantically correct. Knowledge of the order of passes can be used to
optimize reparsing after the program is modified. For example, there is no need
to go through all parsing stages after refactoring. It is enough to build AST
trees for all modified files and go through stages 2-3-4 for all C# files. 4. Basic types Two basic data types are used for internal
representation of the program: AST nodes and entities. AST nodes are grouped
into AST trees. Entities are gathered into the entity model. The rest of the
data structures are auxiliary. Note that the transformation parser uses one
more data type - tokens. 5. Abstract syntax trees A separate abstract syntax tree consisting of
AST nodes is built for each C# file. Only trees that are correct or partially
correct from the syntax (but not semantic!) point of view are built in the
process of parsing the program. Syntactically incorrect parts of the program
are ignored. For example, a namespace cannot be defined inside a function and
is ignored. A partially correct syntax subtree or node is built when the
general syntax rule can be defined unambiguously, even if some part of the data
for generating the complete rule is missing. For example, for the expression: A f ( B b ) { return
b as; // missing as-expression cast-type } the CsAsIsExpression node that contains no cast
type is generated. An abstract syntax tree can be easily transformed
or modified. Such actions are performed by replacing one node with another,
deleting existing nodes and creating new ones. 5.1 AST nodes Each AST node corresponds to a syntax rule of
the C# language grammar. Nodes are related to each other via bidirectional
parent-child links. Each AST node, except the compilation unit, refers to its
parent. An AST node can contain any number of references to children depending
on its type. Additionally, some nodes (for example, goto statements) refer to
nodes that are not their parents or children. Each node is responsible for
deleting its children. When a parent node is deleted, all its children are also
deleted. Any subtree or node can be detached from the
tree if the reference to it is deleted from its parent node. To keep the model
correct in this case, the reference to the parent node should be deleted as
well. A node that has no reference to its parent is called detached; a node
referring to its parent is called attached. Nodes can contain additional information
necessary for the representation of the program: names of identifiers, string
literals, modifiers, etc. Nodes contain references to entities. AST trees
are related to the entity model with the help of these references. References
to entities are divided into two types: entity definitions - references to entities
generated by the node itself, and entity references - references to entities used
by the node. For example, the CsProperty node contains one
definition referring to the property entity and two references referring to the
getter and setter methods. 5.2 AST nodes and tokens For each C# file, the C# transformation parser
generates a list of tokens stored separately from the AST tree. Each token is
assigned an index. It is unique for tokens belonging to one file. The nodes of
the transformation tree contain token indexes corresponding to the syntax rule
of the node. Program example: void f() { g(); } The relation between the character
representation of the program and tokens (C# transformation parser): The relation between tokens and AST nodes (C#
transformation parser) 6. Entity Model The entity model is global for the entire
project. It is used to link together all files and assemblies. The entity model
is a tree whose root is a global namespace. In the process of parsing a program that has
errors in it, a partially correct entity model can be generated. For example,
two types with the same name can be declared in one namespace. In this case
both types will be defined in one namespace. But the "Cannot resolve
entity" error will be displayed for each attempt to find one of the two
types by its name. Such an error is processed in a different way for those
entities whose definition semantics depends on the order they are declared in
the program (enumerations and local entities). The second entity with the same
name is neither created nor added to the definition scope. Unlike AST trees, the entity model is not
subject to transformations. It is possible only to add entities to the entity
model or remove them from it. 6.1 Entities An entity is a named object defined within some
scope. Entities are related to each other via bidirectional parent-child links.
All entities, except the global namespace, have one and only one parent. The
global namespace has no parent. An entity can have any number of children
depending on its type. It is responsible for deleting its children. When an
entity is deleted, all its children are also deleted. Children are stored
either in the hash table of their scope or in special fields. An entity
imported from an assembly contains a reference to the assembly from which it
was imported. 6.1.1 Global and local entities Global
entities correspond to the types of metadata defined in the ECMA-335 standard.
Global entities are visible from outside and any part of the program. Other
assemblies can refer to them. Local and auxiliary entities are introduced to
simplify the internal logic of the parser and to generalize the algorithms of
searching for entities. They are neither visible nor available outside the
method in which they are defined. The
correspondence between global entities and metadata types:
Local and auxiliary entities:
6.1.2 Entity name The name of an entity can be specified directly
in the program, imported from an assembly or generated according to the rules
of the C# language. Some local and auxiliary entities have no names. 6.1.2.1 Special method names The names of special methods (constructors,
destructors, implicit and explicit conversions, operators) are specified
according to the ECMA-335 standard. 6.1.2.2 Property method names The names of methods for a property “P” are
specified by adding prefixes to the name of the property. The getter method is
formed by adding the get_ prefix to the name of the property (get_P). The
setter method is formed by adding the set_ prefix to the name of the property
(set_P). 6.1.2.3 Event method names The names of methods for an event “E” are
specified by adding prefixes to the name of the event. The add method is formed
by adding the add_ prefix to the name of the event (add_E). The remove method
is formed by adding the remove_ prefix to the name of the event (remove_E). 6.1.2.4 Indexer names An indexer in C# corresponds to a property with
parameters, but it has a number of peculiarities. The name of an indexer in C#
is not taken into account in the overload resolution algorithms. One can even
say that indexers in C# have no name. But according to the CLR specification,
each method must have a name. That is why the C# compiler names a property
“Item” by default. Correspondingly, the get and set functions of such a
property are named get_Item and set_Item. It is possible to specify another
name for the indexer property with the help of the System.Runtime.CompilerServices.IndexerName
attribute. 6.1.2.5 Explicit implementation names Compound names are generated for the explicit
implementations of a method, a property or an event. The generated name is the
full name of a method, a property or an event implemented by the explicit
implementation. Example: namespace N { class I { public void f(); } class C : I { void I.f() {} // N.I.f } } If the implemented method, property or event
cannot be found due to an error, the name is not generated and the entity
remains nameless. Explicit implementations are not taken into
account while searching for members of a class. In other words, they are
invisible. 6.2 The relation between AST nodes and entities Program example: namespace A.B { class C { public int P { get() { return 0; } } } } Entities generated out of C# files refer to AST
nodes that generated them. In turn, AST nodes contain references to the
entities generated by them. 7. Supported non-standard
extensions of the C# language introduced by Microsoft 7.1. __makeref Used for converting a variable into System.TypedReference.
The parameter used for constructing a reference must be a left-hand expression.
Example: int x; TypedReference tref; tref = __makeref(x); 7.2. __reftype Used for getting the type (System.Type) by the
reference (System.TypedReference). Example: TypedReference typedref; Type type; type = __reftype(typedref); 7.3. __refvalue Used for getting the value stored by the
reference (System.TypedReference). Example: TypedReference typedref; int x; x = __refvalue(typedref,
int); 7.4. __arglist Used for working with functions taking a
variable number of parameters. The __arglist keyword is used in three cases: 1. As the last formal parameter of a function
taking a variable number of parameters. Example: [CLSCompliant(false)] public static void
WriteLine(String format, Object arg0, Object arg1, Object arg2,Object arg3,
__arglist); 2. As a parameter of the constructor of the
System.ArgIterator object. Example: ArgIterator args = new
ArgIterator(__arglist); 3. As a parameter while calling a function
taking a variable number of parameters. Example: int AddInts(__arglist) {} void Call() { int result = AddInts(__arglist(1, 2, 3, 4,
5)); } |
||||||||||||||||||||||||||||||||||||||
© 2005 metaspec |