Over the past few months, I’ve started the shaka-scheme project for my senior project at the University of Hawaii at Manoa, and I wanted to highlight two pieces of code that recently changed our entire course of development:

(define proc1 (lambda ()
  (define define 1)
  (display "define is actually ")
  ; This doesn't display a procedure literal...
  (display define)
  (display " in this scope\n")))
(proc1)

(define proc2 (lambda ()
  (define-syntax define
    (syntax-rules ()
      [(define a b) (display "fooled you\n")]))
  ; This doesn't do the normal define...
  ; And it's not an error.
  (define 1 2)))
(proc2)

In Scheme, you can redefine what are called primitive expression symbols inside of non-global scopes, such as within a lambda expression. If you run this using GNU Guile or Racket, you’ll find that this code runs flawlessly — it prints out:

guile-redefine-define

So what?

On the Shaka Scheme project, we decided to go with a tree-based approach — essentially, we would build an AST in-memory during parsing, and then evaluate the AST through tree traversal.

Untitled Diagram.png
A conceptual diagram of our tree-based AST.

This seemed like a relatively simple way of doing things. We would adopt a supposedly more efficient primitive (the IDataNode tree with children stored in a std::vector), and we would also get a nice AST if we ever wanted to debug in the future.

Unfortunately, the design called for strict primitive expression forms that assumed that (define a b) was the only valid form — because why wouldn’t it be? Of course, that all changed when you realize that (define a b c) could be valid in a certain context.

In that instant, the “fixed” structure of the various AST specifications we were using to represent the parsed forms of the various Scheme expressions basically broke. In addition, the fixed “parsing” grammar defined for the top-level <expression> rule can no longer look ahead for (define and correctly assume that the rest of the <define> rule will look like: ( define <identifier> <expression> ) as define itself could possibly be a completely different procedure… or even just 1.

The solution

The solution is to simply redo the entire evaluation and parsing specification to make NO ASSUMPTIONS.

Scheme, like most other Lisps, are notorious for being runtime-heavy in the sense that figuring out the types of things and the bindings of identifiers is harder to do statically without going down the rabbit hole of more rigorous analysis.This rings especially true here, as almost any primitive expression keyword can be a complete different procedure given the right context and redefinitions. Therefore, all parsing must be agnostic to the idea that define means only the usual define procedure, and every single expression is simply a list waiting to be evaluated (unless, of course, you can prove that it does represent the usual define expression in that context).

Second of all, Scheme lists may not be more elegant when stored as in-memory trees where the list is represented as a root with its children as the elements. The reason being is that car  and cdr may require taking slides of an array or vector of children, and keeping track of the size of slices of a vector is somewhat more cumbersome than simply using pointer access to check of the existence of the next element. Why keep track of slices or trees if one can simply do the usual, canonical representation of Scheme lists as linked lists?

Third of all, Scheme R7RS macros become significantly more complicated without the uniform structure of “everything is a list,” as traversal over non-uniform trees makes parsing on macro-derived expression types becomes harder to reason about.

For example, parsing (let ((x 1) (y 2)) (+ x y)) (which is a classic macro-implemented or derived expression type, according to R7RS) requires knowledge that

  • let is a macro
  • ((x 1) (y 2)) is not a procedure call, but a list of syntax primitives that will be manipulated by the macro

How is one to decide these things before actually evaluating the value of the let identifier in the current environment? There is no answer without more rigorous static analysis of the scope in question.

Remarks

I personally chose Scheme as “the easy” language to implement as a semester-long “programming language-focused” project in C++. Unfortunately, the extremely dynamic nature of Scheme (let’s not get started on continuations…) means that implementing its finer parts is definitely not a trivial task.

Lisp is an especially interesting case for evaluation, because so little things are known at compile-time. The extremely flexible semantics of Scheme gives its implementation a sort of hardness that surpasses that of something like a MIPS processor simulator.

If you’re still not convinced of the non-toyness of Scheme, why not read some of the latest Scheme standard, R7RS?Scheme is a relatively full-featured language compared to the Lisps of old, and sports the following features:

  • Continuations with call/cc
  • Expression-based macros with the define-syntax/syntax-rules system (standing apart from the R6RS syntax-case macro system and the classic defmacro system)
    • Note that C has token-based macros with its preprocessor. Expression-based macros are more powerful.
  • A full-featured numeric system, with support for true fractional/rational types, inexact/exact number differentiations, and complex numbers.
  • Optional lazy evaluation
  • Records (also known as product types and comparable to C’s struct)
  • Strings, with some optional Unicode support
  • Bytevectors and vectors, which represent linear data structures with fixed-size
  • Exceptions
  • Library/Module system

 

Advertisements