Syntax highlighter

2015-01-23

Implementing non portable syntax-case

There are 2 portable syntax-case implementations, psyntax and SRFI-72. Both can be run on R5RS implementations and provide not only syntax-case but also library mechanism defined in R6RS. It is very reasonable to have such a huge mechanism instead of just providing syntax-case only because bunch of things pretty much related with syntax-case (or more precisely procedural macro and library system).

You can use it as your expander however if an implementation already has own module system then it might not a good idea to use it as it is. As my understanding, at least SRFI-72, expands all bindings to only one namespace which makes the namespace or symbol table really huge and may impacts memory usage. And again, this is reasonable decision because R5RS doesn't provide module system and if you need to write portable library then everything needs to be defined in one single namespace.

Since Sagittarius is using none of them but implementing own syntax-case expander and library system, it wouldn't hurt to write an article for this. Plus, apart from my lack of searching skill, I couldn't find any article mentioning how to implement it, other then couple of papers (could be that's enough for everybody...), I thought I can contaminate the search engine result a bit more with mine. This is basically my memo before I forget everything.

WARNING: The following sections may or may not contain mistakes. DO NOT TRUST :P

How to implement syntax-case on top of own library system


Implementing syntax-case is not an easy task to do especially in a portable way. The 2 portable implementation provides syntax-case expander and library system. This is because R5RS doesn't provide module system and R6RS procedual macros require phasing to resolve from where the bindings are imported during macro evaluation (not expansion) period. However if an implementations already has own library system, it is better to implement syntax-case on top of it so that the implementations can reuse the existing libraries without modifying anything. The following sections describe how Sagittarius' macro expander is implemented on top of own library system.

Pre-requirements


On Sagittarius, 2 Scheme objects were introduced for syntax-case, identifiers and macros. Identifier has the following slots:
  • name: raw name of the identifier, symbol
  • envs: compiler environment frame, alist
  • library: a library where this identifier belongs, library object
  • identify: a unique symbol generated per macro expansion, uninterned symbol
  • pending: a flag that indicates if the identifier is a template variable or not, boolean
The last slot, pending, is used to make an identifier unique so that evaluating template variable makes unique identifier each time. This may not needed if macro expander itself generates unique bindings like SRFI-72. However doing so may cause symbol explosion so I decided to introduce this. Internally, macro expansion is done by renaming symbols or identifiers. Each expansion remembers which symbol or identifier renamed to which so that if a symbol is renamed to an identifier twice then it will be the same object in sense of eq?.

Macro has the followings slots:
  • transformer: a procedure which expands given expassion
  • envs: a compile time environment when this macro is bound, vector
Whenever a macro is invoked, then VM swaps current macro environment with the one the macro has. And new identity is generated to rename symbols or identifier properly. Swapped environment and identity are restored when the transformer returns.

Renaming


Renaming, on this context, means converting symbol or identifier to new identifier. There are 2 rules when renaming is invoked:
  • Symbols are renamed to be global identifier (passing null environment frame)
  • Not all identifiers are renamed (pattern variables, non local variables, etc.)
The idea of the first rules is inspired by SRFI-72 implementation which very first step is wrap all symbols to identifiers with empty environment. To avoid unnecessary memory allocation, I've decided to delay this step as late as possible. Thus raw symbols and identifiers created with empty environment are treated the same.

The second one is preservation. For example, local variables need to be preserved so that it can be referred. The same goes pattern variable and pending identifier which the pending slot contains #t.

Renaming is done 3 times in total. The first one is during syntax-case compilation. In this step, both pattern and template are renamed. Then the pattern variables are stored in compile time environment. The second one is during syntax compilation. In this step, only raw symbols are renamed to identifiers. The last one is during expansion. The target template needs to be renamed so that it can generate fresh identifier each time. After the expansion, raw symbols are also renamed if there is. The raw symbol conversion is required because of R6RS procedures. (actually not needed if I don't care about compliance but unfortunately I do.)

Identifier equivalence


Comparing identifiers are crucial. R6RS defines 2 types of comparison procedures, bound-identifier=? and free-identifier=?. On Sagittarius, returning #t from bound-identifier=? means given 2 identifiers have the same identities and libraries. Thus they are generated on the same macro expansion phase. Free identifiers are pretty much simple, it just checks if the given 2 identifiers are bound to the same value. This also means different named identifier can be #t in sense of free-identifier=?. (R6RS allows implementations to behave like this.)

Local variables are also simply looked up. Compiler just needs to compare variables in sense of bound-identifier=?. Because of the trick described above section, the same source identifiers are renamed to the same object. This makes look up process simply and easy. Even though the process is simple, however, comparing identifier and symbol needs special treatment. In basic concept, raw symbols and global identifiers are the same. The special treatment is when an identifier contains the same environment frame as current compile time environment frame. Variable look up is done per frame and whenever an identifier is generated by macro expander, then it contains environment frame of that moment. Following is the example case:
(import (rnrs))

(let ((a 1))
  (let-syntax ((foo (syntax-rules () ((_) a))))
    (let ((a 2))
      (+ (foo) a))))
The macro foo contains the first a which will be an identifier when it's expanded. However the compiler environment frame contains raw symbol a. So to look up it properly when the identifier contains the same frame, then it needs to compare the target as either raw symbol or global identifier.

Pit falls


Fresh template variables: R6RS requires the following template variable to be unique:
(import (rnrs))

(define-syntax define-dummy
  (syntax-rules ()
    ((_)
     (define dummy))))

dummy
;; -> unbound variable error
As I described above section, SRFI-72 resolves this by renaming all bindings to unique symbols. However this might be overkill if implementations already have its module system or namespace. A symbol can have different bindings if module system allows. On Sagittarius, a binding uses the raw symbol name of an identifier. Thus without any treatment, above piece of code can be run without problem. To resolve this, I introduced pending slot. The syntax which creates bindings such as define, then it always swap raw symbol name of pending identifier to uninterned symbol so that the identifier can not be seen out side of the template where it's defined. This is rather ugly solution, so I want something better.

Pattern variables: pattern variables are also identifiers except its environment frame is not a valid compiler environment frame. This is needed because pattern variables need to be preserved. However preserving pattern variable made incompatibility of portable implementations. The particular issue is following:
(import (rnrs))

(define-syntax foo
  (syntax-rules ()
    ((_ x)
     (let-syntax ((bar (syntax-rules (x)
                         ((_ a) #t)
                         ((_ b) #f))))
       (bar b)))))

(foo a)
;; -> #f
The pattern variable a is not renamed so the first pattern of bar won't match (because the pattern variable a and input variable a are not the same in sense of free-identifier=?). I believe this behaviour is still R6RS compliant since it doesn't specify this type of pattern variable renaming (if I read it correctly). To avoid this, the following is one of the solution:
(import (rnrs))

(define-syntax foo
  (syntax-rules ()
    ((_ x)
     (let-syntax ((a (syntax-rules ())))
       (let-syntax ((bar (syntax-rules (x)
                           ((_ a) #t)
                           ((_ b) #f))))
         (bar b))))))

(foo a)
;; -> #t
This behaviour is, again if I read R6RS correctly, specified in R6RS. In this case, pattern variables need to be renamed.

Conclusion


Using portable implementations is much easier then implementing own syntax-case. As long as implementations don't have own module system, it is better to use it. I just wanted to show at least there is a way not to use portable implementations. If you can integrate syntax-case on your own implementation, then core part can be really small (less than 1000 LoC in my case).

No comments:

Post a Comment