XML Query data model, algebra, and stuff

From: Dan Connolly (connolly@w3.org)
Date: 10/11/01

I said in this week's telcon that I'd send something
about XML Query and XML data models and such...
[at least I think so; I don't see any minutes yet...]

Hmm... where to begin...

I'm sure folks are familiar with this notion...

  [[[ ... XML is isomorphic to the subset of Lisp data
     where the first item in a list is required to be atomic. ]]]
  --        1998: Advice for XML, W3 and ICE
  Fri, 07 May 1999 20:56:22 GMT

So... if XML is like Lisp data, what's the equivalent
of car/cdr/cons?

The XML 1.0 spec doesn't say. It goes to great lengths
to say which character strings are good XML and which
are not, and it strongly hints at an element/attribute/content
structure, but it never quite nails down the details, formally.

Since then, W3C specs have come up with a variety
of answers to this question of "what's the data
model for XML"? A working group was chartered
to specify the "XML Information Set", which was
intended to give the missing details from the XML 1.0
spec; this was supposed to be aligned with the
W3C Document Object Model, which is a sort of oddly
named API for XML. The XML infoset is just about done,
finally. Meanwhile, the XSLT folks came
up with another data model for XML: the XPath data model.
Then the XML Query folks wrote a data model from
their perspective; they were mostly coming from
the type theory and database theory angle.

Why all these different specs and models? Can't we all
just get along? What are the differences?

Without going into "why?" too much...

The XML infoset has observers (car/cdr) but no
constructors (cons). So it skates by without answering
lots of questions about identity and such.
It's also bogged down with historical stuff like
entities, XML 1.0 attribute types (CDATA vs. NMTOKEN), etc.

The XPath 1.0 data model ignores the historical stuff (yeah!)
and it has a clear notion of node identity (there's
even a generate-id() function that gives a name for each
node that's required to be 1-1) but it doesn't have
constructors per se, and it's not strongly typed.

The XML Query folks put a lot of stock in strong typing.
They want to know the type of the result of a query
statically. A sticking point is operators like parent();
it's next to impossible to know the type of a parent()
statically. The XML Query data model incorporates
the whole type system from XML Schema -- not just the
primitive int/date/string types, but also the complex
types that range over element "shapes".

When I first took a close look at the XML Query data
model and algebra, I really liked a lot of it: it
was quite precise, formally, using stuff like inference
rules for the semantics. And it had constructors
(ala cons) as well as observers (ala car/cdr).

But... to my disappointment, while the constructors
looked like mathematical functions, they weren't.
It wasn't the case that
	elemNode("anElt", []) = elemNode("anElt", [])

I raised this as an issue:

  XML query constructors: not well-defined
  Dan Connolly (Thu, Apr 12 2001) 

The editor got back to me 2 May
and again 14 Jun
but I'm still not sure if I'm satisfied.

Meanwhile, a lot has changed, and I'm not sure if the issue
I raised is still relevant.

The XPath folks decided they were interested in the XML Schema
type system too, and by and by, the XML Query and XPath folks
merged their stuff:

  XQuery 1.0 and XPath 2.0 Data Model
  (replaces the former XML Query Data Model),
  last release 7 June 2001 

  -- http://www.w3.org/XML/Query

I haven't read this more recent stuff. Has anybody else? You
might also be interested in

  XQuery 1.0 Formal Semantics
  (replaces the former The XML Query Algebra),
  last release 7 June 2001 

not to mention the XQuery spec itself:

  XQuery 1.0: An XML Query Language, last release 7 June 2001 

More on XQuery and/vs. query-with-inference separately...

Dan Connolly, W3C http://www.w3.org/People/Connolly/

This archive was generated by hypermail 2.1.4 : 04/02/02 EST