simple RDF datatype

This is a brief write-up of a very very very simplified version of the datatyping proposal, following Patrick S's 'occams razor slash fest' message. The only special vocabulary that this uses is rdfs:dlex; it abandons rdfs:dtype and rdfs:drange (uses rdfs:range instead).

It may well be that after all this noodling I have simply re-described Sergey's proposal in different language. Well, hey, consensus is what we are after, right??

Pat Hayes 2/21

1. Literals

In RDF, urirefs and blank nodes are both considered to be referring expressions; they are used to denote resources. Literals however are best thought of simply as syntactic 'labels' which indicate a lexical form. These lexical forms can be used to restrict the references of other nodes by using datatype schemes, but this use is optional.

A literal without any datatyping information associated with it has no fixed meaning in RDF. It can be thought of as a blank node with a literal 'label' attached to it. This label can be used in association with a datatype to fix the meaning of the literal, but the RDF semantics associates no particular fixed interpretation to a 'bare' literal in the absence of datatyping information.Thus, a triple such as

Jenny ex:age "35" .

in effect means that the value of the property is something that can be indicated by the literal label. RDFS provides a way to say this explicitly:

Jenny ex:age _:x .
_:x rdfs:dlex "35" .

where the second triple asserts simply that _:x is a value which can be represented by the character string. This does not in itself 'fix' the value, of course, but it can be used as a way of making the association between the value and a lexical form explicit, for later use or amplification. We will call this a lexical form triple. A useful way to think of the meaning of rdfs:dlex is: "..can be described by the character string.." or "..can be a value of the literal..."

These two forms - the single triple with a literal as object, and the similar triple with a bnode as object, together with a lexical form triple linking the bnode to the literal - are identical in meaning and can be substituted freely for one another. The first is obviously more compact and often easier to 'read', but the second form provides distinct nodes for the literal itself and for its value, which is sometimes useful.

Neither of these forms, by themselves, fixes the value of the literal. However, applications are of course free to use 'bare' literals, and to rely on string-matching to resolve questions of identity. Such use amounts to a decision to understand a bare literal as denoting its own label (and to understand rdfs:dlex as identity). It would be risky to rely on such a convention to perform extensive RDFS inferences, however, as this assumption can be overridden by other datatyping information, in general, so any inferences based on this assumption would need to be re-checked and perhaps revised if datatype information were added to the RDFS graph. Applications that do not make extensive inferences about identity should function in this way without meeting serious problems.

An example of such 'in-line' use of a bare literal to indicate a string is provided by dc:title in the Dublin Core.

2. Datatypes

If the intended meaning of literals is understood by a set of users or applications, then the simple use case illustrated by the above example could be sufficient. This 'untyped' kind of usage is always available in RDF. However, RDF also provides ways to use datatypes to assert that a literal should be interpreted in a particular way.

A datatype is defined abstractly by two domains, one of lexical forms and one of values, and a mapping from lexical forms to values. We assume that a datatype is indicated by a URI, and that some external mechanism is able to recognise a datatype URI and to access and make use of appropriate representations of the domains and map when supplied with the URI.The model theory is stated in terms of a global function L2V from datatypes to the lexical-to-value mapping of that datatype. In the examples below, urirefs which are being interpreted as datatype names will be indicated by the use of the color green.

3. Datatype triples

The simplest way to talk about the value of a literal under a datatype mapping is to use the datatype name in place of rdfs:dlex. The result is called a datatype triple. For example

Jenny ex:age _:x .
_:x xsd:number "35" .

says that Jenny's age is the value of the literal under the datatype mapping xsd:number, i.e. that Jenny is 35.

A datatype triple says that the literal is a legal way to express the value according to the particular rules associated with that datatype's lexical-to-value mapping. The intuitive reading might be "..can be described, according to this datatype mapping, by the character string..".

(Notice that this is 'backwards' from the usual way of understanding a datatype mapping as applying to the lexical form and resulting in the value; the reason for this is simply the RDF syntactic convention that prohibits literals in subject position.)

The datatype triple is the most 'local' style of literal datatyping in RDF.The interpretation imposed on the literal by the datatype is local to this triple. This means for example that the same literal can be used simultaneously in two different such triples, imposing different interpretations on two different nodes. For example, if ex:octalnumber were a datatype property, then we could also assert

ex:octalnumber rdf:type rdfs:Datatype .
Judy ex:age _:y .
_:y ex:octalnumber "35" .

to assert that Judy's age was 29, and both uses of the literal could be used in the same RDF graph. Similarly, two different literal representations of the same value could be specified using two different datatype triples which include the same subject:

_:y ex:USdecimal "12.25" .
_:y ex:germandecimal "12,25" .

Obviously, this only works when the literals do in fact map to the same value under the respective mappings.

It is possible to 'lock down' all literals in an RDF graph, forcing them to be interpreted by a single datatype, by requiring rdfs:dlex to be a subproperty of that datatype property. For example, the literal-are-strings assumption mentioned in section 1 can be made explicit and enforced by referring to the XML string datatype xsd:string as follows:

rdfs:dlex rdfs:subPropertyOf xsd:string .

Such usage should be treated with caution, however, as it applies to all assertions in the graph, including any that are imported from other sources.

3.1 domains and ranges of datatype maps.

We make one additional assumption concerning the use of datatype properties: they have exact domains and ranges. Normally in RDFS, an assertion about a range:

ppp rdfs:range ccc .

is understood to say that the precise range of ppp is a subset of the class ccc. This allows RDFS to combine multiple range assertions coherently and reflects the fact that the language has no way to express a 'lower bound' on the membership in a class. However, we will assume that for datatype properties, such an assertion is true only when ccc is the exact range of the property, no more and no less. This exact range is the lexical space of the datatype, so:

ppp rdfs:range ccc .

asserts that the class ccc is precisely the set of lexical forms that are acceptable to the datatype ppp. For example, one could write the following as a way to ensure that a property is applied only to literals of an appropriate form:

ex:UScalendarDate rdfs:range _:x .
ex:birthdate rdfs:range _:x .
ex:Jenny ex:birthdate "05-08-02" .

This is worth a little careful analysis. Notice that this does not assert that Jenny's birthdate is a value of the datatype in the first triple (we show how to do that in the next section.) What this says is that Jenny's birthdate is the literal itself, as outlined earlier; the literal is used in-line, and not interpreted via a datatype. But it also asserts that literal is in the lexical space of that datatype. (To see why, notice that it is obviouisly required to be in the range of ex:birthdate; and that in turn is required to be a subclass (not necessarily a proper subclass) of the range of the datatype mapping, which in turn is precisely the lexical space of the datatype.)

4. Attaching a datatype to a literal though a property.

As pointed out above, the two forms

Jenny ex:age "35" .

and

Jenny ex:age _:x .
_:x rdfs:dlex "35" .

are synonymous. The second of these can be modified (or extended) to incorporate datatyping information, but the first has no obvious 'place' to put a datatype property. However, RDFS allows one to attach a datatype to a property such as ex:age in such a way that all uses of that property impose the datatype on the object of the triple, by asserting the datatype to be the range of the property:

ex:age rdfs:range xsd:number .

Jenny ex:age "35" .

means that Jenny's age is 35. The rule here is that the subject of the rdfs:range must be used to interpret any literal used as an object of the property, anywhere in the same graph.

Since the in-line and lexical-triple forms are identical in meaning, rdfs:range produces a similar effect when used with rdfs:dlex, for example:

ex:age rdfs:range xsd:number .

Jenny ex:age _:x .
_:x rdfs:dlex "35" .

also asserts that Jenny's age is 35, in exactly the same way. The rule here is that the subject of the rdfs:range must be used to interpret any lexical form triples whose subject is used as an object of the property.

This can be formally captured by a special closure rule which inserts the appropriate datatyping triple, as follows:

If the graph contains:	then add the triple:
ppp rdfs:range ddd . xxx ppp uuu . uuu rdfs:dlex LLL .	uuu ddd LLL .

These rules apply to any such usage of the property anywhere in the RDF graph, so this way of using a datatype has a much wider 'scope' than a datatyping triple, and therefore needs to be used with care. For example, if several different literals are linked to a single node, then long-range datatyping can produce a conflict:

ex:age rdfs:range xsd:number .

Jenny ex:age _:x .
_:x rdfs:dlex "37" .
_:x rdfs:dlex "29" .

The blank node here is required by the datatype triple to have two distinct values at the same time. This situation is called a datatype clash, and is best avoided.

A datatype clash is not strictly speaking an inconsistency in RDFS, but it does mean that the datatyping information in the graph is internally incompatible. Exactly what happens in such a case is not determined, and depends on the external datatyping machinery. Possible responses include posting an error condition, or choosing one of the possible datatype mappings at random, or according to some external priority schedule. As none of this would be recorded in the RDF grpah itself, however, such behavior should not be relied upon.

One sure way to avoid datatype clashes is to never use rdfs:dlex with two different literals on the same node, or use any datatyped property (such as ex:age in our example) with more than one literal with the same subject.

We note that we are here using datatype urirefs both as properties and as class names. This is quite legal in RDF, and the usage we have adopted in fact requires that the two are related: the datatype class is the value space of the datatype. Formally, the following is true for any datatype ddd:

ddd rdfs:domain ddd .

If you find this odd, feel free to ignore it, and think of rdfs:range when applied to a datatype name simply as connecting the datatype to the property as a kind of 'macro' which forces literals to be properly interpreted and 'translates' occurrences of rdfs:dlex into the appropriate datatype property.

What if you want to say that the range of a property is a datatype class without automatically triggering long-range datatyping? Easy: use a different name for the class. Or use a bnode, as follows:

_:x rdfs:subClassOf ddd .
ddd rdfs:subClassOf _:x .
ex:myproperty rdfs:range _:x .

The range of ex:myproperty is now a class which is both a superclass and a subclass of the datatype class ddd, which is exactly the 'same' class - has the same things in it - but called by a different name (or, as in the example, no name at all), so the rdfs:range assertion doesn't mention any datatypes, so it does not make any connection to literals. The datatyping only attaches to the actual name used in the rdfs:range triple, and if that is not a datatype name, then nothing special happens.

5. Model theory

Basic assumptions about datatypes are as follows.Suppose I is an rdfs-interpretation of a graph E. Then I is a datatyped interpretation just when:

(1) IEXT(I(rdfs:dlex)) = {<x, x>}

For any datatype uriref ddd:

(2) IEXT(I(ddd)) = {<x,y>: x=L2V(I(ddd))(y)} ie the inverse relation of the datatype mapping
(3) ICEXT(I(ddd)) = {x : <x, y> in IEXT(I(ddd)) } ie the value space of the datatype
(4) If LLL is a literal node with the literal label "lll" and E contains

aaa rdfs:range ddd .
bbb aaa LLL .

aaa rdfs:range ddd .
bbb aaa ccc .
ccc rdfs:dlex LLL .

ccc ddd LLL .

then LV(LLL) = L2V(I(ddd))(lll), ie the literal value is the value of the literal under the datatype.

------------------------------------

Then usual definitions of datatype-satisfiable, datatype-entails, and so on. However, there won't be such a neat 'entailment lemma' relative to a nice neat notion of closure. Sorry about that :-)

-------------------------------------

BTW, this assumes untidy literal nodes. With a few deft tweaks to the MT we could manage with tidy literals, in fact: but if they were ever allowed to be subjects of triples, that would completely kill the tweaks and we would have to allow untidy literals again, so I wonder if it is worth it.