Introduction to OOMs

Site Navigation


Contents

  1. Preface
  2. Stochastic Time-series and Sequences
  3. Models as Probability Distributions
  4. OOMs and Probability Distributions
  5. Basic Learning Algorithm
  6. Optimal and Efficient Estimation of OOMs
  7. Outlook

Introduction to OOMs

Tobias Oberstein, Feb. 2003


PRELIMINARY VERSION


1. Preface

Observable Operator Models (OOMs) are generative systems which can model stochastic time-series data and sequences. What follows is an informal introduction to OOMs and topics related to the field.

top


2. Stochastic Time-series and Sequences

Time-series capture time dependent, time varying aspects of reality in series of measurements. At each instance of time, the aspect to be captured is sampled on some scale yielding a number.

A stochastic time-series is supposed to be either generated by a process inherently non-deterministic, deterministic but scrambled with noise or at least to appear non-deterministic because of our lack or will of precise measurements.
Formally, the line of distinction can be drawn quite accurately: inherent stochastic time-series are realizations of stochastic processes, whereas pseudo-stochastic time-series are discretized trajectories of dynamical systems in a chaotic regime.
Examples are weather data like temperature patterns, financial market data like stock quotes, speech and voice data and distorted or very complex sensory signals.

On the other hand, one often speaks of (stochastic) sequences when the aspect of reality captured has no obvious scale or when ordering (of measurements) is important but cannot be interpreted as ``time'' in the common sense.
An example of the former is a natural language text, since the alphabet is finite and has no ``scale''. An example of the latter is a biosequence like DNA since the position of a symbol on a DNA strand has no direct interpretation as an instance in time.

top


3. Models as Probability Distributions

One may look at stochastic time-series and sequences in terms of the probabilities of their occurrence. From this perspective a model is given by a collection of probabilities for all possible realizations one could observe. Now, OOMs are models of stochastic time-series and sequences in the sense that they are formalized, compressed descriptions of the probability distributions of those series and sequences. OOMs are generators of stochastic time-series and sequences in the sense that they also constitute a stochastic mechanism for producing those series and sequences.

top


4. OOMs and Probability Distributions

OOMs build a bridge between linear algebra and the theory of stochastic processes. They do this by describing the change of our knowledge about the future of a stochastic system by means of linear operators selected on the basis of present observations. Our knowledge about the future of a stochastic system is captured by a conditional distribution which specifies the probabilities of possible future developments given a particular, already realized past. This knowledge about the future of the system is subject to change when we observe a new present outcome. Hence, the newly arrived information must be reflected in our knowledge about the process future. Updating our knowledge about the process future is done by applying a linear operator selected from a finite fixed set of operators on the basis of the present outcome.

Any stochastic process can be described using OOMs when we allow for OOMs of infinite dimension. On the other hand, OOMs with finite dimension can still describe a broad class of stochastic processes. Indeed, the class of processes thus describable is a strict superset of those which can be captured by Hidden Markov Models with finitely many hidden states.
OOMs with finite dimension have some nice properties. First, their defining linear operators can be written as matrices, which allows one to apply all the standard tools known from linear algebra. Second, we can decide when two OOMs define the same stochastic process, that is when they are equivalent and thereby see how equivalence classes of OOMs arise.

top


5. Basic Learning Algorithm

Model training in the context of OOMs means estimating linear operators from time-series or sequences sample data. The basic learning algorithm defines how to estimate the matrices representing operators of finite dimensional OOMs from sample data. The procedure makes use of so-called indicative events and characteristic events which can be thought of as rasters through which the sample is exploited. It is important to understand the following three facts about the basic learning algorithm for OOMs:


  1. it gives no clues about how to choose those indicative and characteristic events
  2. given infinite training data, one may choose indicative and characteristic events virtually arbitrarily and nevertheless almost surely arrives at a perfect model estimate
  3. given finite training data, the specific choice of indicative and characteristic events if of utmost importance for model quality, thus far from irrelevant

Fact 1 states that one is formally free (modulo a practically always fulfilled side-condition) to choose indicative and characteristic events with the basic learning algorithm.

Fact 2 states that the basic learning algorithm is asymptotically correct, that is given infinite training data in the limit, the OOM estimated from the sample is almost always right. This holds true irrespective of particular choices of indicative and characteristic events.

Fact 3 states that the situation with finite training data is fundamentally different. With finite training data, the particular choice of indicative and characteristic events is most important as it will determine the quality of the estimated model. The basic learning algorithm is asymptotically correct, but nothing is said about the speed of convergence or possible bounds of model quality for finite training data, both of which are dependent on the particular choice of indicative and characteristic events used in estimating an OOM. The OMK's learning algorithm gives an implicit and practical answer to the problem of choosing indicative and characteristic events when there is only finite training data available.

top


6. Optimal and Efficient Estimation of OOMs

First steps, methods and foundations towards a systematic treatment of optimal and efficient estimation of OOMs from finite data where developed in a recent master thesis (Efficient Training of Observable Operator Models using Context Graphs, Tobias Oberstein, 2002). We wanted our OOM learning method to be "optimal" since the amount of available training data is finite in practice and "efficient" since computational resources available in estimating the model are usually restricted.

Three important subgoals had been identified. First, a measure of quality for estimated OOMs was needed, otherwise methods of "optimal estimation" could not be developed or compared. Second, we felt that a data structure was needed which efficiently exploits the structure inherent in the training sample, allows to represent candidate choices of indicative and characteristic events and facilitates the computation of counting statistics needed for the basic learning algorithm. Third, ideally a theory would be needed that allows to derive results for choosing optimal indicative and characteristic events.

The first subgoal, defining a rigorous measure of model quality is essentially solved. Fundamental notions from information theory have been customized and applied to measure quality of estimated OOMs and experiments were done to verify the dependency of model quality on training sample size and the particular choice of indicative and characteristic events.

The second subgoal has been met by developing a new data structure, context graphs which build on suffix trees and are capable of representing and indexing all variable length contexts within a finite sequence. Further it was shown how to represent all candidate choices of indicative and characteristic events as special colorings of the context graph of the training sample. This is crucial to efficiency since it allows to represent the training sample and all candidates for indicative and characteristic events, that is rasters to exploit the sample within one unified data structure. Context graphs also provide means to control the complexity of the rasters used in exploiting the training sample. This is of value, since the third subgoal likely involves finding a balance between overfitting and underexploiting the sample data.

Regarding the third goal, the OMK's learning algorithm is a practical, working first answer, nevertheless with clearcut theoretical underpinnings. However, there can be no doubt that we are at the beginning of a general theory of optimal OOM estimation from finite data.

top


7. Outlook

One might suspect that under the constraint of finite sample data we must strive to fully exploit the sample achieving good model precision while not overfitting the sample maintaining good generalization performance. Further an elaboration of this idea when applied to OOMs could require research along lines similar to those outlined in statistical learning theory (-> Vapnik) or model selection using the principle of Minimum Description Lengeth (MDL) (-> Rissanen).
Statistical learning theory explicitely and deeply addresses the problem of optimal model estimation from finite data, but only for static patterns and not time-series.
Consequently, it might be interesting to apply and customize methods from statistical learning theory or model selection using MDL to optimal OOM estimation from finite data.

top