Our work on URDF is motivated by the huge explosion of Web-extracted, semantic knowledge bases we have seen in recent years. We believe that query answering over Web-extracted RDF facts provides one of the best showcases for the management of uncertain data —and in particular for probabilistic databases—we have seen so far. A guiding theme for the project is to resolve data uncertainty directly at query time. Inference in URDF is always triggered by a user query, and so you will always find a query (either formulated in SPARQL syntax or in a first-order logical representation) when dealing with URDF.
On the right-hand side you see a screenshot of UViz, the
visualization frontend of URDF.
URDF employs rules formulated in a subset of first-order logic (FOL) for reasoning, as they are known from the context of logic programming and deductive databases. Specifically, we focus on a combination of soft deduction rules (which are represented as Datalog rules with weights) with additional hard consistency constraints, the latter of which allow us to constrain which of the possible instances of the RDF knowledge base are considered to be valid. Deduction rules are grounded against the extensional data (i.e., the "facts") captured in the underlying RDF database in order to derive new intensional facts from existing ones. Consistency constraints, on the other hand, help to clean both the extensional and intensional facts by resolving inconsistencies that may arise from violations of these constraints. All the reasoning steps are fully integrated with a possible-worlds-based probabilistic database backend and are performed directly at query time. Probabilistic inference supports the full expressiveness of Datalog, including recursion and stratified negation, which forms also the core of Relational Algebra, SQL and many Semantic-Web-related representations such as RDF/S, OWL-DL and OWL-Lite.
Reasoning in URDF generally consists of two phases: 1) grounding the query pattern against the rules and 2) the actual probabilistic inference step. We currently investigate efficient algorithms both for MAP (Maximum-A-Posteriori) inference and for computing marginal probabilities of query answers. Our most recent extensions include:
Recent advances in the field of information extraction allow us not only to extract large semantic knowledge bases from structured or loosely structured Web sources, but also to extract additional annotations along with the facts these RDF knowledge bases contain. Among the most important such types of annotations are spatial and temporal annotations. In particular the latter temporal annotations help us to reflect that a majority of facts are not static but highly ephemeral in the real world, i.e., facts are valid for only a limited amount of time, or two or more facts may stand in a temporal dependency with each other. Time-annotated RDF knowledge bases pose additional challenges for both grounding and probabilistic inference. In addition to atemporal constraints, for example, expressing that "a person can have only one place of birth", temporal constraints can express that, for example, "the date of birth should be before the beginning of the marriage of a person". The consideration of temporal annotations led us to define an extension of our data model by temporal deduction rules and temporal consistency constraints, which very seamlessly integrates different aspects from both probabilistic and temporal databases.
Knowledge bases extracted from the Web sources are inherently incomplete. Deriving new facts via deduction rules is one way to address this issue. In order to automatically learn this form of soft deduction rules (in Datalog) from the extracted RDF data, we investigate how to adapt Inductive Logic Programming (ILP) techniques to large and incomplete RDF knowledge bases. ILP is a well-known technique which lies in the intersection of logic programming and machine learning. Although powerful, ILP algorithms are inherently expensive as they incur an exponential growth of the search space when constructing these rules, as the number of predicates inside the rules increases. In addition, the evaluation of each candidate rule becomes more expensive, as the number of training examples obtained from a Web-extracted knowledge base is rising (the "background knowledge"). Moreover, for noisy and incomplete knowledge bases, the automatic (i.e., unsupervised) selection of positive and negative training examples needed for learning new rules poses a major challenge to the adoption of these techniques. Andre's Master thesis investigates this setting under the additional aspect of learning rules with numerical and categorical constants. These rules allow us, e.g., to detect correlations among the income and the education level of people living at particular geographic regions (when applied to a census database).
Correlations (or more precisely: dependencies) among facts in an RDF database may arise from two sources: 1) static input dependencies derived from real-world observations at extraction time; and 2) dynamic dependencies induced at query time by the lineage structure we obtain by tracing the grounding structure of the rules. Correlations can again be considered as both soft and hard dependencies, e.g., capturing strict logical dependencies like implication, mutual exclusion or transitivity, or they may come in the form of arbitrary soft (i.e., weighted) correlations, both of which are captured via factors that connect RDF facts. Javeria's Master thesis describes an initial approach to incorporate both types of soft and hard rules into a unified framework for consistency reasoning over Web-extracted RDF data.
Mohamed's Master thesis investigates an initial implementation of a rule-based query processor over RDF facts which are assumed to be stored in a disk-resident database backend. It provides an integration of the rule-based URDF reasoner with the RDF-3X SPARQL engine, and it presents optimizations along several dimensions: (i) top-down vs. bottom-up query evaluation techniques, (ii) extensive caching to reduce both disk accesses and rule evaluations, (iii) regrouping of extensional and intensional predicates into blocks which allows better utilization of the underlying storage engine's ability to optimize traditional relational queries, and (iv) a probabilistic way of looking at join ordering and cost estimation in this context.
UViz is a fully Web-enabled visualization frontend for a variety of reasoning tasks in URDF. UViz is built in a client-server architecture using the Adobe Flex RIA framework, and it runs in most common Web browsers via a Flash plugin. The Adobe BlazeDS service is used for a fast binary client-server communication, and the Flare visualization toolkit provides a sophisticated graph-rendering library that drives our visualization frontend. UViz delivers an intuitive and highly dynamic visualization experience to visually support and access the URDF reasoner, including UI tasks like exploring the knowledge base, rule-based and statistical reasoning, resolving inconsistencies, and explaining answers through lineage. Timm's Master thesis presents an in-depth description of the UViz architecture.