🎉 New Year Sale: Get 20% OFF on all plans — Use code NEWYEAR2026.

Upgrade now
Authored By: Eric Sven Ristad, Robert G. Thomas

Hierarchical Non-Emitting Markov Models

Feb 28, 2024

The Markov model has held its fort as the cornerstone of statistical language modeling for a considerable length of time. Throughout this period, numerous alternatives have graced the scene, but none has been able to provide a superior blend of predictive performance, computational efficiency, and ease of implementation. This article focuses on an augmentation to the Markov model: hierarchical non-emitting state transitions. Despite the states in our model retaining their Markovian characteristic, the model itself loses the Markovian identity because it can signify boundless dependencies in the state order distribution.

Benefits of Non-Emitting Markov Models

Resultantly, the non-emitting Markov model proves to be more robust than any Markov model, inclusive of the context model, backoff model, and the interpolated Markov model. More crucially, the non-emitting model consistently surpasses the best Markov models on natural language texts, under a broad series of experimental conditions. Non-emitting model also offers comparable computational efficiency and ease of implementation to the interpolated Markov model.

The article proceeds with the elucidation of motivating the underlying issue of time series prediction, which is to combine the probabilities of events of different orders. Further, there is a comparison of the interpolated Markov model and brief demonstration of the equivalence of interpolated models and basic Markov models of the same model order. The concept of hierarchical non-emitting Markov model is introduced and substantiated with proofs demonstrating its supremacy over any Markov model. The article then provides efficient algorithms to enhance the parameters of a non-emitting model on data.

Analysing Performance of Non-Emitting Markov Models

In the succeeding sections, empirical results for the interpolated model and the non-emitting model on the Brown corpus and Wall Street Journal are reported. Lastly, we conjecture that the non-emitting model excels empirically because it imposes a quasi-Bayesian regulation on maximum likelihood techniques.

In real-world scenarios of time series problems, future occurrences are contingent on the entire past, and to replicate this in our model, we can ideally try to increase our model order as much as practical. However, it is important to consider that we only have a finite volume of training data to estimate from. This creates a tension between model complexity and data sparsity which is fundamental to time series modeling. An effective model must be able to encompass events of both higher and lower orders.

Integration of diversified orders in Non-Emitting Markov Models

The two major technologies applied for amalgamating events of varied orders are backoff and interpolation. In an interpolated model, the transition probabilities from lower and higher order states are combined using mixing parameters. In a backoff model, the event probabilities are blended according to a partial order which typically favours higher order events over lower order events. The article presents how backoff models and interpolated models are formally equivalent to basic Markov models. Consequently, backoff and interpolation are merely parameter estimation schemes for basic Markov models.

Finally, it emphasizes that unlike interpolation and backoff, non-emitting transitions are not merely an estimation method, instead, they raise the expressive power of the model class. Because of this characteristic Non-emitting models are strictly more powerful than the class of basic Markov models.