RDF graph syntax.

Pat Hayes, Florida IHMC.

DRAFT, liable to be altered without notice, cite at your own risk.
Inspired by discussions about blank node scoping in the W3C DAWG WG. All of the views expressed are my own, however.

0. Introduction.

The syntax of RDF, the base layer of the W3C semantic web standards, is defined in a slightly unusual way, using a mathematical abstraction called 'RDF graphs' rather than by using a familiar syntactic meta-language like EBNF. This is an attempt to explain the basic ideas of this 'graph abstract syntax'.

1. What is an 'abstract' syntax?

The idea of an 'abstract syntax' is originally due to John McCarthy, and it was intended as a way to describe the mathematical structure of a parsed expression, rather than the particular details of a surface notation.

There are all kinds of possible minor variations in how a formal notation might be encoded as strings of characters; for example, the application of a function to two arguments might be shown by writing the function name first followed by the arguments separated by commas and enclosed in parentheses

add(3, 17)

or in LISP style by moving the parentheses and omitting the commas

(plus 3 17)

or in a postfix style with a distinctive terminator

3 17 plus ;

or in any number of alternatives. If we say that a language is defined by an EBNF grammar, then these all have to be defined as different languages: but in another sense of "language", they are all just minor variations on a common theme, a single language. For example, when specifying a semantics these differences are irrelevant; what matters is simply that there is an application of a function to two arguments. An abstract syntax tries to describe this common theme in a way which ignores the possible surface variations, by describing these linguistic structures - the outputs of a parser rather than the character strings which are input to it - directly, as mathematical objects. It is important to emphasize that an abstract syntax is not a linguistic surface syntax in the conventional sense, i.e. a grammar: it is a mathematical account of a class of structures.

McCarthy's original scheme used an algebraic style of description in which the abstract syntactic objects were formed by applying constructor operations. The RDF abstract syntax uses a much simpler mathematical model, but the idea and motivation are similar: to provide a purely mathematical description of what might be called the essential structure of RDF, independently of the particular surface syntactic variations which might be used to convey this structure in a machine-transmittable notation.

2. Why use an abstract syntax for RDF?

A central motivation for adopting this strategy was to support a variety of such surface notations for RDF, and still have them all count as "real RDF", rather than commit to a particular surface notation as the only possible compliant notation for RDF. The defining characteristic of an RDF notation is then that it can be parsed into data whose structure properly mirrors the purely mathematical abstraction which "is" RDF (and whose processing conforms to the semantics, which is itself defined relative to the abstract syntax). This allows for a wide variety of "correct" RDF notations, including RDF/XML, Turtle, N3, node-arc diagrams of the kind shown in various RDF documents, concept maps generated by COE, and even database tables.

But there are other motivations. The RDF graph syntax is in several important respects simpler than any surface syntax, and makes possible a very simple and straightforward - almost elementary - approach to some surface-syntactic issues which are notoriously troublesome to get exactly right, especially the issue of bound name scopes.

Many formal languages use bound names (or bound variables: for our purposes, a 'variable' is just a special kind of name) to convey syntactic associations of various kinds. There is an entire mathematical theory of 'binders' (logical quantifiers, Church's lambda operator, Landin's J-operator, the mu-calculus, etc.) whose exotic notations, like the warning colorations on a poisonous reptile, are testament to the harm they can cause if handled carelessly. The complications that they necessitate in what would otherwise be simple ideas (such as substituting an expression for a name in another expression) are notorious (such as, in this case, requiring free, but not bound, occurrences of names in the substituted expression to be bound by enclosing, but not non-enclosing, binders of the name in the target expression.) Carefully crafted prose must be written defining such distinctions as free versus bound names, binder scopes, scope gaps and exceptions, etc... To establish meta-theorems about these notations, if done strictly, often requires proofs of such mind-numbing complexity that they are almost always abbreviated or summarized in a phrase like "...using a familiar structural induction." And yet, remarkably, neither binders nor bound names need even occur in a well-written abstract syntax; they, and all their attendant complexity, are an artifact of the process of encoding a graphical structure in a linear, sequential notation. Bound names are a kind of character-coding syntactic scaffolding, necessary only to link one 'place' to another. Notations which permit such linking to be indicated directly, using graphical conventions - which includes both of the acknowledged historical roots of all modern formal notations, Peirce's existential graphs and Frege's Begriffsschrift - simply do not need them. The RDF abstract graph syntax has the same simplifying advantage: it provides for purely existential assertions without using the existential quantifier (which would have to be assigned a syntactic scope) or needing a special syntactic category of existentially bound variable.

3. Blank nodes and identifier scopes

That last claim may seem to be contradicted by the common observation that RDF amounts to data tables (containing triples) enriched with the existential quantifier. While not strictly correct, this description has the merit of focusing on the part that often causes a great deal of trouble: the things 'bound' by that quantifier, known in RDF as blank nodes. This is precisely where the abstract/concrete syntax distinction comes the fore and illustrates why this common way of describing RDF is incorrect. In the RDF abstract syntax, blank nodes really are blank. They have syntactic identity, but no lexical form. They are not bound names: they are simply there, like empty blobs in a node-arc diagram. If a surface syntax for RDF uses some kind of character string to identify each blank node, then such identifiers can be naturally thought of as being bound names, with a 'implicit' existential quantifier binding them and enclosing each RDF graph document (or documents). This is often a helpful (and accurate) way of thinking about the meaning of an RDF notation, since it represents the most direct way to transcribe RDF into a conventional first-order logic notation. But it is a very misleading way to think about RDF syntax.

If blank node identifiers are used in a surface syntax for RDF, then they must be considered to be part of the syntax of that surface notation, not of the RDF graph syntax itself. In particular, their scoping rules, which determine when two such blank node identifiers are supposed to indicate the same blank node, and when they are not, should be provided by the specification of that surface notation. Many possibilities arise. The notation might allow a collection of documents to be treated as having a common scope for blank node identifiers, allowing a single RDF graph to be described by such a collection. Or, it might specify that each document defined a local blank node identifier scope, so that if the same identifier were re-used in another document it would indicate a distinct blank node. Or, it might provide for local scoping of blank node identifiers inside a document, so that a single document could describe several distinct RDF graphs, re-using a blank node identifier to indicate different blank nodes in distinct graphs. Or maybe something altogether more complicated. But, to repeat, these are all issues that are determined by the specification of that notation itself, and are not part of RDF. Issues of name scope simply do not arise when talking about the RDF graph syntax itself.

4. How can a syntax be constructed from entities that do not have any lexical form?

This question often seems to boggle people who are used to dealing with EBNF. The straight answer is: if we are describing the syntax mathematically, then a syntax can be built up from anything we damn well like. We could say that the syntax is built up from numbers, for example, which is what Goedel did to prove his famous theorem. In fact, the "real nature" of things isn't amenable to mathematical description. All that we can do is to characterize the essential syntactic structures mathematically. In the RDF abstract syntax, there are three basic categories: literals, URI references and blank nodes. Literals and URI references are familiar 'syntactic' kinds of entity: they are constructed from lexical character strings, with well-defined lexical forms and with independent lexical definitions. But all the RDF abstract syntax says about blank nodes is that they are members of some set of things which is disjoint from the other two categories. So, one might ask, what kinds of thing are these blank nodes? What do they look like? How will I know one if I see one? But these are not the right questions to ask about an abstract syntax. The abstract syntax doesn't say anything what things are, or look like, and it doesn't require even that its components have any lexical form at all. The whole point of defining blank nodes in this negative way, amounting almost to a non-definition, is to emphasize that as far as the RDF syntax is concerned, they have no structure. They are simply ciphers, playing a purely 'positional' role in the syntax. The only answerable questions about an RDF blank node are, is this a blank node? and, is this blank node the same as that blank node?. These are the only questions concerning RDF blank nodes that a RDF surface notation should undertake to provide answers for, or that are necessary in order to process RDF graphs or to define their meanings and logical relationships.

5. Blank node identity and scope.

It might be objected that the claim made above, that issues of scoping do not arise in RDF, is disingenuous and doesn't really face up to the facts. After all, we seem to often make the implicit assumption that the 'scope' of a blank node is the RDF graph containing it; and what is this mysterious "RDF graph", if not simply a back-handed way of referring to the scope of an existential quantifier? While this objection has some force, to which I will return below, it is important to emphasize that it is technically, mathematically, mistaken. RDF blank nodes do not have a scope local any RDF graph. The way that the RDF graph syntax is described, all three categories of the basic components of every RDF graph - URI references, literals and blank nodes - are treated identically. They all have, if anything, a "global" scope, although it would be more correct to say that they have no scope at all. RDF graph syntax simply does not mention scopes of RDF terms. It is important to keep this point quite clear: as far as the RDF abstract syntax is concerned, two different RDF graphs may share a blank node, just as they may share a URIreference or a literal. Such sharing can, in principle, transcend any barriers of time or space between the representations of the RDF graphs.

I should emphasize immediately that most of the extant surface notations for RDF do not provide for the construction of such an example (of two different RDF graphs which share a blank node) since they all have 'document-local' scoping rules for blank node identifiers. Without some way to cross-refer between such documents, there is no way to describe what one might call 'siamese graphs' of this kind. Nevertheless, there is nothing in the RDF specs which prohibits such a case; and indeed, we have contemplated such a situation in the SPARQL WG by talking about told bnodes, which would create exactly such a shared-blank-node situation between the target graph and a different graph described by the answer document.

It is this very uniformity and simplicity - all components having a global identity - that makes the RDF abstract graph syntax so direct and robust. An RDF graph is simply a set of triples; there are no bound names and no name binders. Substitution of one item for another is simply defined as applying a substitution mapping. Questions of syntactic indistinguishability become simply questions of identity. When stating basic definitions, one can forget about issues of name scoping, accidental name clashes, accidental name capture by enclosing binders, and so on. What are often thought of (and talked of, I will confess) as different name scopes become, in the abstract syntax account, simply disjoint sets of blank nodes: and here, 'disjoint' really does mean the elementary set-theoretic notion of not having any elements in common. The only care that is necessary is to sometimes talk of merging rather than forming a union; but even here, it is important to bear in mind that this is not really a name-scoping distinction, but instead it is just as described, two different mathematical operations defined on mathematical objects.

6. What counts as an RDF graph?

I said above that one possible objection had some force, and it is this: where are the boundaries of an RDF graph? The RDF spec defines an RDF graph to be a set of triples. What it does not say, however, is whether this is sufficient as well as necessary: whether any set of RDF triples counts as an RDF graph. The abstract syntax is simply silent on this matter, so different stances could be adopted. Plato might argue that every set is an RDF graph, so that every RDF graph with N triples has 2|N-1 other graphs associated with it, and there is an infinity of RDF graphs all existing timelessly in a kind of RDF heaven. In reply, Occam might argue that sets are one thing, but the only actual RDF graphs that really exist are the ones that we are obliged to admit because documents describe them, and all the rest are mere ghostly abstractions that have not yet achieved RDF-graph-hood. (The W3 TAG might say that Plato is right but that Occam has a point, that RDF graphs with no describing document aren't Web Resources.) But, the main point I want to make here is that as far as the entire development in the entire RDF spec is concerned, this is simply not an issue. Since both the syntax and the semantics of RDF is defined using the graph abstract syntax, which is all expressed in normal vanilla basic mathematical language, it applies to the sets of triples that are counted as RDF graphs, no matter what criteria one adopts to decide whether or not a given set of triples should be counted as being an RDF graph. If you think it should, then use the specs. If you think not, don't. Either way, the specs are clear on what a subgraph is, what union and merge are, what an interpretation is, what entailment means, and so on. Again, IMO, this illustrates the simplicity and robustness of the basic RDF model.