Introduction

There is currently a huge amount of data available from the typical computer connected to the Internet. However, most of that data is currently designed to be read by humans and not by machines. The result is that people can readily find data but they cannot readily get their machines to automatically process it for them. This document presents a solution to that problem. In summary, a user will create a small new piece of machine readable data that describes how some (possibly large) existing data resource can be interpreted as relational data in a Virtual Database (VDB). The user will then be able to write VDB queries that access and combine data from all such data sources.

The "Semantic Web" as described in Scientific American[1] certainly has similar goals, and it might be reasonable to think of what we describe here as an implementation of the semantic web, but we see our approach as quite different from that described at w3.org[2]. (We intend to address that topic - on the queue. ***)

This approach has the following desirable properties.

The facility described here is built on the AP5 VDB. One important difference between the AP5 VDB and traditional relational databases (RDBs) is particularly relevant to the WWDB application. RDBs deal only with complete tables, i.e., it must always possible to enumerate all of the rows of a table. There are many cases where data available on WWW is most naturally modeled as a relation that does not have this property. For instance, airline sites typically allow queries that report flights with a given source and destination on a given date, but they typically do not support a query to report all flights. AP5 does support such relations. It also supports a much more general class of "computed" relations than RDBs. For instance, one might define a date relation that supports generation of dates between a start date and an end date. This could be combined with an airline schedule relation to write a query involving flights between two dates. In order to do this in an RDB one would be reduced to "programming", i.e. writing a loop that iterates over dates and does a schedule query for each one.

AP5 already provides interfaces adequate for accessing arbitrary data and treat it as a relation. The only additions described in this document are conveniences for loading the definitions from web pages. Other than that, the main contribution of this document is illustration of how to use the facilities already in AP5 to access external data.

How to use WWVDB

In case you have reached this point without realizing it, AP5 is implemented in lisp. In order to run it you need to get a copy of lisp running (I currently use clisp), download the AP5 source from ap5.com and build it following the directions in README. (I might try to put some binary distributions on ap5.com in order to make it more convenient to get this far.) Next you have to load the code for loading from url. (I might arrange in the future to include this in the build, which would eliminate another step.) This is a fairly trivial extension. It simply downloads a file from a url and loads it into the lisp image. In order to avoid conflicts among different url's defining the same relations, functions, variables, etc. the load is done in a package constructed just for that url. The intended use is that the file retrieved from the url will not do an in-package but simply define relations and whatever other data and functions are needed to support them. In some cases it will provide functions that can be used to define other relations. The user can either refer to those facilities in their original package or import them into his own package. The packages created for loading urls use the ap5 package, so the code should not have to deal much with other packages.

It should be obvious that running arbitrary code from WWW is inherently dangerous. Loading a file into lisp runs the code in that file. We generally expect users to examine the code before loading it. We expect the code to be relatively small and transparent (see examples below), so this should not be a big problem. Alternatively, one might try to "sandbox" the WWVDB, e.g., by altering the underlying lisp. However that topic is not further addressed in this document.

The code for loading a url is in loadurl.lisp. The current version relies on clisp to execute shell commands, a unix shell to execute them and the wget program to retrieve a url. This is not particularly portable, but it is small and we expect that those who want to use WWVDB on other platforms will have no trouble replacing this code. In fact, it is so trivial that I reproduce it right here.

;; code for loading the contents of a url into ap5
(in-package :ap5)

(defun wget (url file)
  (EXT:RUN-SHELL-COMMAND
   (format nil "wget -O ~s ~s" file url)))

(defun loadurl (url name)
  ;; name is both the name of the file and the package
  (wget url name)
  (let ((*package* (eval `(defpackage ,name (:use lisp ap5)
			    (:shadowing-import-from ap5 loop ++ --)))))
    (load name)))

In summary, the directions for running WWVDB:

At this point you are in a position to follow the examples below.

Some examples

We start from the Read-Eval-Print loop in an AP5 image.
(in-package :ap5)
This puts you in the AP5 package. You could, alternatively, create your own package. You probably want it to use the AP5 and LISP packages. We assume you have already downloaded loadurl.lisp. Now you load it.
(load "loadurl.lisp")
Now we load a url. The first argument is the url, the second is used both as the name of the downloaded file and the name of the package in which to load it. See http://ap5.com/inflation.wwvdb for the code about to be loaded.
(loadurl "http://ap5.com/inflation.wwvdb" "INFLATION")
 ... output from wget omitted
;; Loading file INFLATION ... 
[defrel INFLATION] 
;; Loaded file INFLATION 
T 
At this point you should probably take a look at http://data.bls.gov/cgi-bin/cpicalc.pl, the web page we're going to use for supplying data. Since the first input is just a multiplier, I define a simpler relation,
(inflation x y z), which is true iff $1 in year x is worth $z in year y.

Note that this is an example of a case where the web resource does not provide a complete set of tuples. We can only generate z given x and y.

(any x s.t. (inflation::inflation 1920 1930 x))
 ... output from wget omitted
0.83 
(any x s.t. (inflation::inflation 1920 x .8)) 
*** - THROW: there is no CATCHer for tag AP5-CANNOT-GENERATE
 ... rest of error output omitted
Nevertheless, this relation could be used in compound queries that don't require other forms of generation.

A more useful (and more efficient) version is defined by http://ap5.com/cpi.wwvdb.

(loadurl "http://ap5.com/cpi.wwvdb" "CPI")
 ... wget output omitted ...
;; Loading file CPI ...
In this case the web page providing the data is ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt. (I recommend taking a quick look at that web page.) Since CPI data changes slowly, the implementation choice was to simply retrieve the data when the file is loaded. This is done with the wget function defined in loadurl. The result is more wget output:
--14:34:06--  ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt
           => `ap5.com:cpiai.txt'
 ... more output omitted ...

[defining CPI]
;; Loaded file CPI
T
At this point we have the CPI relation defined in the CPI package:
(cpi year index) - consumer price index data from BLS
As demonstration of that fact:
 (loop for (x y) s.t. (cpi::cpi x y) count t)
94
(I won't try to teach you how to use AP5 here. See the manual.)

Note that I prefer to use upper case package names. The reason is that the lisp reader converts to upper case (outside of strings). If I had used "cpi" instead of "CPI" then instead of cpi::cpi I would have had to type |cpi|:cpi.

If you examine and think about the code in the CPI file you may be a little disappointed. All it does is read the data, interpret it as tuples in a relation, and store it in a real relation. That could be done with any database in combination with any programming language. Let's move on to something you would probably not approach with a normal RDB. PROCFS code

 (loadurl "http://ap5.com/procfs.wwvdb" "PROCFS")
--15:30:47--  http://ap5.com/procfs.wwvdb
           => `PROCFS'
 ... 8 lines omitted ...
;; Loading file PROCFS ...
[defining PROCESS]
[defining PROCESS-STAT]
[defining READ-STRING-ELT
WARNING: Estimated size of (AP5::X0 AP5::X1) s.t.
          (#,(DBO RELATION READ-STRING-ELT) AP5::C0 AP5::X0 AP5::X1)
         is infinity
]
[defining PROCESS-PARENT]
[defining PROCESS-UTIME]
[defining PROCESS-STIME]
[defining PROCESS-STARTTIME]
[defining PROCESS-VSIZE]
[defining PROCESS-RSS]
[defining PROCESS-CMDLINE]
;; Loaded file PROCFS
T
The process relation has one tuple (containing only one integer) for each process.
 (loop for x s.t. (procfs::process x) count t)
124
The process-parent relation contains a set of pairs, (x y), where y is the parent of x. For instance,
 (any (x y) s.t. (procfs::process-parent x y))
14921 ;
7363
process 7363 is the parent of process 14921. Utime and Stime are user and system cpu times used by the process (in 1/100 sec. units, I think), starttime is the clock time when the process started (seems to be 100'th secs since last reboot), etc. All of these can be combined now in queries such as the following.
 (listof (p ut cmd) s.t.
   (and (procfs::process-utime p ut)
        (> ut 10000)
        (procfs::process-cmdline p cmd)))
in order to find the commands of processes with > 100 sec of utime.

It should be clear by now that external data must be viewed as nonatomic, i.e., they change outside the AP5 transaction mechanism. Changes to that data therefore cannot be used to trigger rules. In fact, most extermal data cannot be updated from within AP5 at all.

As a final example I mention a facility that does not actually define any relations at all. MYSQL code

(loadurl "http://ap5.com/mysql.wwvdb" "MYSQL")
This provides code that makes it convenient to access tables in a MySQL database as AP5 relations. The file itself contains further documentation. However, here's a brief sample:
(loadurl "http://ap5.com/mysql.wwvdb" "MYSQL")
 ... output omitted
(defrelation T2 :representation 
   (mysql "BugReport" "mysql -N -B -e ~S --password=... --user=... --host=... test"
          ("id" "text") (number string) :indices (("id"))))
 ... more output omitted
(listof t2) 
((1 "NULL") (2 "hello 13 2 13") (3 "hello 14 3 4") 
 (4 "hello 15 hi 3 132007-07-19 14:31:06")) 

Last modified: Sat Jun 14 16:20:13 PDT 2008