Re: Query Language Issues

From: Ian Horrocks (horrocks@cs.man.ac.uk)
Date: 11/15/01


On November 13, Pat Hayes writes:
> >This message is to reply to the points in your earlier messages that I
> >have not replied to --
> >
> >>  > I think all that matters for our formalization and core
> >>  > design work is that a query answer may make use of the
> >>  > bindings to only a subset of the variables in the query
> >>  > pattern.  So, my recommendation is that for now we consider
> >>  > a query answer to consist of a set of bindings for a subset
> >>  > of the variables, call them “query variables”, and that the
> >>  > query specifies which variables are in that subset.
> >
> >[the above excerpt has been edited to match the message I sent out last
> >evening.]
> >
> >>  That would be my natural inclination also, but I think it needs to be
> >>  extended some. For example, we might want to distinguish between the
> >>  case where the query is said to be OK but no bindings are provided,
> >>  from the case where the query is simply answered with 'no' or 'fail'
> >>  or whatever. Ian wants to distinguish the 'don't know' (ie unprovable
> >>  from KB) answer case from the 'no' (ie, contradictory with the KB)
> >>  cases. So in general there might be other information in an answer
> >>  than just the bindings alone.
> >
> >Yes, there will be more information in a query result than just bindings
> 
> OK, but I feel we are going in circles here. Doesn't the first-quoted 
> paragraph above say the exact opposite of this? (Or are you making a 
> distinction between a query ANSWER and a query RESULT ?)
> 
> >(and more information in a query than just a query pattern).  That is a
> >core motivation for the various components of the query language design
> >we have been putting together and I have been proposing to the
> >committee.
> >
> >If a query has no query variables, then a "yes" answer to the query
> >would be represented by a query answer containing an empty set of
> >bindings.  When no answers are found for a query, the query result would
> >contain no query answers. 
> >
> >Note that the proposal is for a query result to contain a specification
> >of how many query answers are entailed ("# answers entailed") in the
> >form of an ordered paired denoting a closed interval.
> 
> As you know from the telecon, I do not like this proposal. Well, it 
> depends on what it means. If this is just a matter of keeping count 
> of the number of bindings produced, then of course it is harmless, 
> though also kind of pointless (and there is no need for the closed 
> interval.)  But I gather you mean this to be a much more powerful 
> idea, to be used to convey intensional information about cardinality 
> of the answer set even in cases where this cannot be inferred from 
> the KB itself. I do not think that we should undertake to propose 
> such a construct in a querying language unless we have a VERY good 
> idea about how these answers are going to be determined, in general.

Just to elaborate on my explanation of the problem. Let's say that
republican-sympathiser is defined as "republican or hasClass
married-to republican", where married to is a UniqueProperty. If Joe
is married to Mary, and Joe is a republican-sympathiser, then we know
that one of the two must be a republican, but asking if each of the
individualy is a republican will give the answer NO (or at least don't
know). 

If Joe and Mary are the parents of John, and we ask how many
republican-sympathiser parents John has, then independantly checking
the political allegiance of each parent would give the wrong
answer. We could answer the question correctly (min 1, max 2), but
this requires a much more complex query answering procedure, i.e., to
find the min value we could ask if John is an instance of the class
(minCardinality n parent republican), incrementing n until we get a
negative answer.

Things are even worse if we ask a "counting" query where the count
isn't relative to a single individual as, in this case, we can't use a
simple cardiality test as above. Imagine, for example, asking how many
republican ancestors John has (where ancestor is transitive), or how
many republicans there are overall. In principal, we may be able to
answer such questions using a similar technique to the one above with
a "spy point" individual (one that is connected to every other object
via some property), but this gets us into an area where we don't know
how to do "practical" reasoning. Moreover, the procedure is not, in
general, terminating - as we know, we could have a KB for which, in all
models, John has infinitely many republican ancestors.

To sum up - we don't know if such queries are theoretically decidable,
and even if they are we do know that no existing reasoner will be able
to (correctly) answer them.

> >  So, the value of
> >#-Answers-Entailed distinguishes the 'don't know" case from the 'no'
> >case in that if the server produces no answers and doesn't know how many
> >answers there are, it would produce a query result with
> >#-Answers-Entailed set to [0,0-0], but if the server knows there are no
> >answers, then it would produce a query result with #-Answers-Entailed
> >set to [0,0].
> >
> >>  >, and that the query specifies which query
> >>  >variables are in that subset.  I will make that assumption in the
> >>  >remainder of this document.
> >>
> >>  Wait. The *query* specifies which query variables are in the set? So
> >>  there could be query variables which are not being, as it were,
> >>  queried? (Then why did you call them query variables?)
> >
> >In response to your messages, I have changed my terminology about
> >variables in the query pattern in the proposal I sent out last evening
> >(as per the excerpt repeated above).  I.e., I am now calling "query
> >variables" the subset of the variables in the query pattern that are
> >specified by the query to be in the query answer.  I think that
> >corresponds to the terminology you have been using in your messages.
> 
> I gather than Ian now wants us to use a different terminology.

I'm simply suggesting that we stick with the established terminology
from DB theory. In the DB case, a query is an expression containing
variables. Variables whose bindings are to be returned are called
distinguished variables, and the others as non-distinguished.  If
there are no distinguished variables, then it is called a boolean
query - the answer is true if every model contains some binding
matching the query, false otherwise.

> 
> >  > >Uniqueness of Answers
> >  > >
> >>  >In general, a query result may contain multiple query answers.
> >>  >
> >>  >ISSUE: What guarantees do we want to make about the distinctiveness of
> >>  >multiple query answers in a query result?  We may want to guarantee that
> >>  >no two answers consist of identical sets of bindings.
> >>
> >>  Why would we want to do that? This might be useful information, eg if
> >>  one queries
> >>  [?x ?y](exists (?z)(and (P ?x ?z)(P ?z ?y)))
> >>
> >>  (Sorry about the KIF; query variables in brackets)
> >>
> >>  then getting back two copies of the binding might allow you to infer
> >>  (if the KB is closed) that there are precisely two ?z's involved. (If
> >>  P is parent-of, this detects  a certain family arrangement which is
> >>  universally frowned upon, for example.)
> >
> >I thought the committee agreed that each query answer would have a
> >distinct set of bindings of the query variables.  (The minutes of the
> >9/25 telecon say: RESOLVED: DQL should provide some mechanism for
> >returning subsets of the variables mentioned in the query. The user can
> >get
> >multiple occurrences of the same bindings for some variables by adding
> >the variables occurring along multiple distinct paths.)
> 
> The issue in which this arose was whether we should allow a query to 
> specify that duplicate bindings should be suppressed., as in SQL's 
> 'select distinct'. Providing a mechanism to allow suppression of 
> duplicates is not a decision to make duplications illegal. In fact, 
> in the context indicated, I think that the last sentence of the 
> resolution can be read as stating that duplications *are* permitted, 
> by exactly the kind of mechanism suggested.

Supression of duplicates may suffer from similar problems to
counting. E.g., if we know that John, Paul, George and Ringo are
friends of Peter, and we also know that Peter has at most 2 friends,
then what answer do we return when asked for the friends of Peter, in
either "distinct" or "normal" mode? (We already asked how many friends
Peter has and were told exactly 2!)

> >  Note that in
> >your example, ?z is not a query variable and so wasn't ask about by the
> >query.  The rationale for requiring binding sets to be distinct is that
> >if one wants to distinguish cases involving multiple bindings of
> >variables in the query pattern (e.g., ?z in your example), then those
> >variables should be included as query variables.  So, in your example,
> >if you want to find out how many ?z's are involved, then include ?z as a
> >query variable.
> >
> >>  > My recommendation is that we guarantee there are no identical
> >>  >sets of bindings
> >>
> >>  I disagree. This is both needless, and onerous for the implementer.
> >
> >Are you disagreeing that an answer result should not have query answers
> >having the same set of bindings?
> 
> Yes, I am disagreeing with that. I think that our decision was to 
> allow the querier to specify whether or not duplications are 
> permitted, and I think the default case should be that they are, as 
> it puts less of a burden on the query-answering engine.

I agree, and as you see in my above point, it isn't even clear that
feedom from duplication is computable.

> I would like us to invent a very simple, basic, vanilla 
> query-answering engine which rapidly produces accurate, if dumb, 
> answers, and whose answer stream, if I may express it in that way, 
> can be 'filtered' by smarter query-answering interfaces to give more 
> sensitive kinds of answers. Removing duplicates would be a very 
> natural kind of filter. Other kinds of smarter query-answering 
> interface might do things like check for identity, try drawing 
> conclusions about cardinalities, etc. .

I agree. In fact this is what I have been saying all along. As we are
discovering, there are all sorts of subtle difficulties waiting to
trap the unwary. Pushing some of the tricky stuff into post processing
means that a more opperational semantics can be applied to these
tasks, e.g., "how many republicans are there?" could just return the
size of the set of answers that the query engine computed. It also
gives more freedom to applications to do things in a way that makes
sense to the particular application (actually, I'm not quite sure if
this is a good thing or not :-)). In such a scheme, an application
that REALLY needed to answer counting questions in some situation
could (try to) do so by using multiple cardinality queries as outlined
above.

> BTW, given that you want to allow cardinalities on numbers of 
> answers, aren't you obliged to at least *count* duplications?
> 
> >If so, that is a substantive
> >disagreement that needs further airing.  Requiring the binding sets to
> >be distinctive seems to me to be a rather basic requirement.
> 
> Why?
> 
> >  What am I
> >missing?
> >
> >>  >and that we enable a query to include an indicator as
> >>  >to whether equal sets of bindings are acceptable.
> >>
> >>  Again, I disagree. First, who is responsible for knowing that two
> >>  things are equal? The KB may not have the resources to be able to
> >>  prove this, but the querying engine might. But in any case, what is
> >>  the utility? We can allow people to include statements of equality in
> >>  the query if this issue bothers them.
> >
> >So, this is more on the issue of what is acceptable as multiple answers
> >to a query.  I don't see how one could include in a query pattern that
> >multiple answers must denote different objects.
> 
> No, but one could in a subsequent query ask whether two of the 
> answers to a previous query were identical or not. (This is a good 
> example of how allowing multiple queries allows each query to be 
> simpler, by the way.)

Yes, but equally, this is a good example of the kind of thing that one
could just leave to a post processor - i.e., answer with multiple
simple queries, but with no "linkage" between consecutive queries. Of
course, this wouldn't allow you to check for duplicates between
un-named answers, but it isn't clear if that would be feasible in any
case.

> >The query pattern only
> >specifies relationships that must hold among a set of objects that are
> >*one* answer to the query.  The issue here is what is allowable as
> >multiple sets of objects, each set of which constitutes a query answer.
> >
> >By the way, it is the server that would be determining that equivalentTo
> >holds for two constants.
> 
> Yes, of course, but my point above is that the querier may have 
> access to information that isn't available to the KB. (BTW, I really 
> dislike that term 'server'; it suggests an asymmetry between the 
> querier and queried that isn't likely to survive in the broader SW 
> world, I think.)
> 
> >
> >We could, indeed, not provide for a query to include the requirement
> >that multiple answers denote distinctive sets of objects.  The client
> >could then test the distinctiveness of multiple answers with follow-up
> >queries (using differentIndividualFrom).  Note that such testing would
> >require many additional queries when there are multiple query variables
> >and/or more than two sets of bindings.
> 
> Right. If you want to know all this, you have to do more work. But if 
> you weren't interested in knowing it, you shouldn't be obliged to do 
> all that work. If you just want a quick dirty answer, you should be 
> able to get one at minimal cost.
> 
> Notice also that these subsequent queries might be entirely generated 
> inside a more sophisticated query-answering engine which would 
> 'compile' a sophisticated query into a series of smaller, dumber, 
> queries and then ask them all in turn (keeping track of the common 
> ground being created, ie the state of the 'conversation'), and 
> finally deliver its more sophisticated answer. That is, we can 
> imagine building more sophisticated query engines on top of less 
> sophisticated ones, so we do not have to get it all done at once, 
> maybe :-).

Maybe. In any case, I again STRONGLY support the idea of specifying a
simple languge - into which more complex queries could be "compiled"
by a query processor. Surely this is what DB systems do (SQL is really
just the "user interface").

> >So, I think enabling a query to include a request for answers to denote
> >distinct sets of objects is an optional feature that we can decide do
> >include or not in our query language.
> >
> >>  >I assume our query language needs to enable a query to specify what is
> >>  >being asked for and our query result language needs to enable a query
> >>  >result to contain the information requested in a query.
> >>
> >>  You havn't said what this means, though. What IS the information
> >>  requested in a query? (Not just a debating point, this seems like a
> >>  central issue that we need to get clear, as many other things would
> >>  then be also clarified, I think.)
> >
> >My message that I resent last evening laying out issues and proposals
> >describes what I mean by the information requested in a query.  That
> >information includes bindings for the query pattern, the number of
> >answers entailed by the KB, and justifications for answers.
> 
> OK, but you need to say exactly what counts as a 'justification'. 
> That term has no self-evident meaning in logic or DL. (See later.)

As I have said before, at this stage, considering "fancy" add-ons like
justification seems to be getting ahead of ourselves to a ludicrous
extent - we don't even know if we can answer any queries yet, much
less justify them!

> >>  This all seems rather like overkill. Do we really want to get into
> >>  infinite ordinals?
> >>
> >>  Why not have the following distinctions:
> >>  query forms:
> >>  1.  one answer (ie first one found)/
> >>  2. (up to) N answers/
> >>  3. all answers.
> >>  replies:
> >>  1. an answer or 'not provable'/
> >>  2. list of M answers, for some M<=N; if M<N then this means these are
> >>  all the answers/
> >>  3. some list of answers.
> >>  Issues: (1) what if there are infinitely many answers? Should we
> >>  provide some way to indicate this even though not all the bindings
> >>  can be provided?
> >>  (2) what about time? For example, it might be useful to get the first
> >>  answer when it is available, even if more have been requested. Which
> >>  suggests another dimension, viz. do you want the answers all at once
> >>  when the entire list is done, or each one when it is fresh?
> >
> >I think I have covered all of those cases in my proposals, and rather
> >succinctly I would claim.  I don't want to go back through the proposals
> >here again comparing them to your list above.  I will do so if you
> >insist.
> 
> But I am suggesting that this is *simpler* than your proposal. It is 
> less expressive, but I suggest that it would be adequate, and easier 
> to implement. Obviously you can express all these distinctions using 
> closed intervals; the questions are, do you need that more expressive 
> notation (why?), and do you have any idea how those more general 
> queries could be answered?
> 
> And BTW, what about the second issue raised above, about timely 
> delivery of answers when others are still being discovered (? If you 
> have covered this somewhere, forgive me for missing it.)
> 
> >
> >>  >Justification For Answers
> >>  >
> >>  >Under the assumption that our query language needs to enable a query to
> >>  >specify what is being asked for and our query result language needs to
> >>  >enable a query result to contain the information requested in a query,
> >>  >the query language needs to enable inclusion in a query a request for a
> >>  >justification for each query answer and the query result language needs
> >>  >to enable inclusion of such justifications in a query result.
> >>
> >>  I don't see how this follows.  Again, I think we need to get clearer
> >>  on what exactly counts as the information requested. If I hold up a
> >>  query then I am asking the KB to prove it for me. What do I want
> >>  back: the fact that it has been proved, some of the bindings in the
> >>  proof, the expressions used as assumptions in the proof, the entire
> >>  proof itself, all the proofs, or what? The last is probably the most
> >>  general form, but it would be overkill for many querying
> >>  applications. Still, it might be worth thinking of 'information
> >>  requested' as being ultimately the entire set of proofs, and then
> >>  defining various kinds of 'pruning' or simplification of this. That
> >>  would certainly cover the variable-bindings, for example, and it
> >>  might also cover justifications. (BTW, what is a justification,
> >>  exactly??)
> >
> >As I said in my message, we have the following issue:
> >
> >>  ISSUE: Whether and how to include provisions for justifications
> >>  to be requested in a query and be included in a query result.
> >>  What kind of justification language(s) do we include in our query
> >>  results language?
> >
> >I think we need to enable queries to ask for justifications and for
> >query answers to include justifications, but that is only my opinion.
> 
> Since you yourself say that you don't know what 'justification' 
> means, it hardly amounts to an opinion at present.
> 
> >  I
> >thought the need for justifications to accompany results in the Semantic
> >Web was one of the compelling points that Jim Hendler made repeatedly.
> >If we believe that point, then it seems to me we need to include a
> >provision for justifications in our query language.
> >
> >Now, what is a justification?  I don't have an answer to that.
> 
> Well then let us work on answering it. (Passing resolutions using 
> words when we have no idea what they mean is just pissing into the 
> wind.)
> 
> As  a start, let me suggest that a 'maximal' justification - the most 
> information that one could possibly hope to get from any kind of 
> deductive query-answering mechanism - is a fully annotated proof (in 
> some suitable format) of the query from the KB, including all the 
> antecedents used in the proof, presented in such a way that it could 
> be checked mechanically with no more computational effort than was 
> used to generate it. Any kind of query that expects more than this 
> would be asking too much.
> 
> Pat
> 
> -- 
> ---------------------------------------------------------------------
> IHMC					(850)434 8903   home
> 40 South Alcaniz St.			(850)202 4416   office
> Pensacola,  FL 32501			(850)202 4440   fax
> phayes@ai.uwf.edu 
> http://www.coginst.uwf.edu/~phayes


This archive was generated by hypermail 2.1.4 : 04/02/02 EST