A Markov Chain is a random process, where we assume the previous state(s) hold sufficient predictive power in predicting the next state. Unlike flipping a coin, these events are dependent. It’s easier to understand through an example. Imagine the weather can only be rainy or sunny. That is, the state space is rainy or sunny. We can represent our Markov model as a transition matrix, with each row being a state, and each column being the probability it moves to another. However, it’s easier to understand with this state transition diagram. In other words, given today is sunny, there is a .9 probability that tomorrow will be sunny, and a .1 probability that tomorrow will be rainy. Text Generator One cool application of this is a language model, in which we predict the next word based on the current word(s). If we just predict based on the last word, it is a first-order Markov model. If we use the last two words, it’s a second-order Markov model. In my example I trained the model using Walden by Henry Thoreau. I also included files of Thus Spoke Zarathustra by Nietszche , and some speeches by Obama to make it easy to experiment. The cool thing about this is that whatever text you train it on, the model spits out really similar text. First we have to import NLTK, the best NLP library in Python. I would say the natural language processing we’re doing here to be pretty mild, but their built in functions save me a lot of code. We then turn the string (taken from the text file) into an array using the split() function.
import random</code> <code>file = open('Text/Walden.txt', 'r')
walden = file.read()
walden = walden.split()
These next two functions are the meat of the code. The Conditional Frequency Dictionary from NLTK that we are going to eventually use has to take the array as pairs. So the phrase “Hi my name is Alex” would becomes [(“Hi”, “my”), (“my, “name”), (“name”, “is”), (“is”, “Alex”)]. The makePairs function takes in an array (a string split by word) and outputs an array in the above format. The generate method takes in a conditional frequency distribution. Think – how many times did each word appear after ‘farm’? That is what a conditional frequency distribution outputs (for all words, not just ‘farm’). The rest of the generate function does is output text based on the distribution observed in the training data. I did this…Read More