# How to fake a sophisticated knowledge of wine with Markov Chains

Markov chain

To the untrained (like me), wine criticism may seem like an exercise in pretentiousness. It may seem like anybody following a set of basic rules and knowing the proper descriptors can feign sophistication (at least when it comes to wine).

In this post, we will be exploiting the formulaic nature of wine reviews to automatically generate our own reviews that appear (at least to the untrained) to be legitimate.

Markov Chains
A Markov chain is a system that transitions between states using a random, memoryless process. The transition from one state to another is determined by a single random sample from a (usually discrete) probability distribution. Additionally, the current state wanders aimlessly through the chain according to these random transitions with no regard to its previous states.

A roll-of-the-dice board game can be likened to a Markov chain; the dice determine how many squares you move, and is in no way is influenced by your previous rolls. Scores in basketball games appear to act in this way as well (in spite of the myth of the 'hot hand') and a gambler's earnings almost certainly hold the Markov property (see the Monte Carlo Fallacy)

Many more phenomena can be appropriately modeled by a Markovian process... but language isn't one of them.

Markov Chains and Text Generation
The image below shows a Markov chain that is built from the following lyrics:

Markov chain of The Smiths lyrics

I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

Here, each word is a state, and the transitions are based on the the number of times a word appears after another one. For example, "I" always precedes "am" in the text, so the transition from "I" to "am" occurs with certainty (p=1). Following the word "the", however, "son" occurs twice and "heir" occurs once, so the probability of the transitions are .66 and .33, respectively.

Text can be generated using this Markov chain by changing state until an "absorbing state" is reached, where there are no longer any transitions possible.

Given "I" as an initial state, two possible text generations are

• I am the heir of nothing in particular
• I am the son and the son and the son and the son and the son of a shyness that is criminally vulgar.

Very often, the generated text violates basic rules of grammar; after all, the transitions are "dumb" stochastic processes without knowledge of grammar and semantics.

Instead of a memoryless chain, though, we can build a chain where the next state depends on the last n states. This can still satisfy the Markov property if we view each state as holding n words. When using these 'higher order' chains to generate text, something very interesting happens. Since the states are now made up of clauses and phrases (instead of words) the generated text seems to magically follow (some of) the rules of grammar, while still being devoid of semantic sense.

The higher order the chain, more text needs to be fed into the chain to achieve the same level of 'arbitrariness'–but the more the generated text seems to conform to actual correct English. In order to fake our wine reviews, we are going to train an order-two Markov chain on a web-scraped corpus of almost 9,000 wine reviews.

The scraping
The corpus of wine reviews I chose to use was from www.winespectator.com. If you go to this site, you'll see that there 709 pages of reviews. I used SelectorGadget to determine the XPath selector for the content I wanted and wrote a few python scripts along these lines:

```#!/usr/bin/env python -tt

import urllib2
from lxml.html import fromstring
import sys
import time

urlprefix = "http://www.winespectator.com/dailypicks/category/catid/1/page/"

#709
for page in xrange(1, 710):
try:
out = "-> On page {} of {}....      {}%"
print out.format(page, "709", str(round(float(page)/709*100, 2)))
response = urllib2.urlopen(urlprefix + str(page))
dom = fromstring(html)
sels = dom.xpath('//*[(@id = "searchResults")]//p')
for review in sels:
if review.text:
print review.text.rstrip()
sys.stdout.flush()
time.sleep(2)
except:
continue
```

and grabbed/processed it with shell code like this:

```# capture output of script
./get-reviews.py | tee prep1.txt

# remove all lines that indicate progress of script
cat grep -E -v '^-' prep1.txt > prep2.txt

# add the words "BEGIN NOW" to the beginning of each line
cat prep2.txt | sed 's/^/BEGIN NOW /' > prep3.txt

# add the word "END" to the end of each line
cat prep3.txt | sed 's/\$/ END/' > wine-reviews.txt
```

This is a sample of what out text file looks like at this point:

```BEGIN NOW A balanced red, with black currant, ... lowed by a spiced finish. END
BEGIN NOW Fresh and balanced, with a stony ... pear and spice. END
```

The "BEGIN NOW" tokens at the beginning of each line will serve as the initial state of our generative Markov process, and the "END" token will denote a stopping point.

Now comes the construction of the Markov chain which will be represented as a python dictionary. We can get away with not calculating the probabilities of the transitions by just storing the word that occurs after each bi-gram (two words) in a list that can be accessed using the bi-gram key to the chain dictionary. We will then 'pickle' (serialize) the dictionary for use in the script that generates the fake review. The code is very simple and reads thusly:

```#!/usr/bin/env python -tt

import pickle

fh = open("wine-reviews.txt", "r")

chain = {}

def generate_trigram(words):
if len(words) < 3:
return
for i in xrange(len(words) - 2):
yield (words[i], words[i+1], words[i+2])

words = line.split()
for word1, word2, word3 in generate_trigram(words):
key = (word1, word2)
if key in chain:
chain[key].append(word3)
else:
chain[key] = [word3]

pickle.dump(chain, open("chain.p", "wb" ))
```

Finally, the python script to generate the review from the pickled Markov chain dictionary looks like this:

```#!/usr/bin/env python -tt

import pickle
import random

new_review = []
sword1 = "BEGIN"
sword2 = "NOW"

while True:
sword1, sword2 = sword2, random.choice(chain[(sword1, sword2)])
if sword2 == "END":
break
new_review.append(sword2)

print ' '.join(new_review)
```

The random.choice() function allows us to skip the calculation of the transition probabilities because it will choose from the list of possible next states in accordance with the frequencies at which they occur.

The results
Obviously, some generated reviews come out better than others. After playing with the generator for a while, I compiled a list of "greatest hits" and "greatest misses".

Greatest hits

• Quite rich, but stopping short of opulent, this white sports peach and apricot, yet a little in finesse.
• Dense and tightly wound, with taut dark berry, black cherry and red licorice. A touch of toast.
• Delicious red licorice, blood orange and ginger, with nicely rounded frame.
• This stylish Australian Cabernet is dark, deep and complex, ending with a polished mouthful of spicy fruit and plenty of personality.

Greatest misses

• From South Africa.
• Tropical fruit notes of cream notes.
• Here's a bright structure. Dry and austere on the finish.
• This has good flesh.
• Really enticing nose, with orange peel and chamomile for the vintage, this touts black currant, plum and meat notes. Flavors linger enticingly.
• Blackberry, blueberry and blackberry fruit, with hints of cream. Crunchy and fresh fruit character to carry the finish.

Possibilities for improvement
The results are amazing, but the algorithm needs a little work before it will be able to fool a sommelier.

One major giveaway is the inclusion of contradictory descriptors in the same review. I don't know anything about wine (I drink Pepsi) but even I know that a wine should never be described as both "dry" and "sweet". One possible solution to this would be to use association mining to infer a list of complementary and discordant descriptors.

Another clue that these reviews are nonsense is the indiscriminate chaining of clauses that have nothing to do with each other. I'm not quite sure how to solve this, yet, but I have a few ideas.

An additional hiccup is that there are still grammatically incorrect sentences that creep through. One solution would be to identify and remove them. Unfortunately, this is much easier said than done. In the absence of a formal English grammar, we have to rely on less-than-perfect techniques like context-based identification and simple pattern-matching.

The last obvious problem is that some of the generated reviews are just too long. This increases the likelihood of containing contradictory descriptors and committing grammar errors, as with this review: (which also exemplifies all of the problems stated above)

Luscious, sleek and generous with its gorgeous blueberry, raspberry and blackberry flavors , with hints of herbs, cocoa and graphite. The long, briary edge lingering on the nose and palate. Medium-bodied, with a modest, lightly juicy and brambly flavors of milk chocolate. Full-bodied, with fine focus and its broad, intense and vivid, with a tangy, lip-smacking profile. Light-weight and intense, with a deft balance.

In future posts, I hope to explore some of these avenues of improvement. I also plan to use parts-of-speech tagging to automate an unusual games of wine review mad-libs.

Sooner, though, I'll explain the process I took to set-up the fake wine review twitter bot (@HorseWineReview) that I will use to experiment with different text-generation techniques.

### 28 Responses

1. James Burt March 28, 2014 / 7:56 am

I found your site while preparing a talk on Markov chains in poetry. Great post!

• [email protected] March 28, 2014 / 9:26 am

Thank you very much! I hope your talk goes well

2. Terry Davis March 9, 2015 / 10:24 am

I don't use markov. God said "no weights".

3. John November 20, 2015 / 2:00 pm

Adore this. I tried to a similar thing to create fake grandiloquent foodstuffs but without the same depth of understanding of programming. The Markov chain is inspired and leads to more serendipity as you have shown. Equally it's good to see how your thinking was similar around discordant combinations and ways in which they might be intelligently programmed out.

• [email protected] December 10, 2015 / 10:26 am

Thanks for the kind words!

4. Foster Clark July 15, 2016 / 12:45 am

Hi Tony,
Thanks for this. I tried using quadgrams i.e. a 3 word key.
This removes most of the obvious gaffes.

Using this 'Database' I wondered if for a given review, it would be possible to decide which wine it was ,using Bayesian inference.?

5. Taya August 3, 2016 / 3:18 am

"This has good flesh" - ha! I found this post while researching how to build my own markov chain for more serious (academic) purposes. It is an awesome tool with so many applications... Thanks for the python!

6. Garrett April 8, 2017 / 2:01 pm

There's actually recent research showing the hot hand is real.

• [email protected] April 8, 2017 / 9:31 pm

That's pretty interesting; I'll have to look into it

7. Trenton Condrin April 9, 2017 / 1:08 am

I'm still learning and a bit too junior in my knowledge to implement this. However, thank you for the inspiration and easy to digest article.

• [email protected] April 10, 2017 / 10:25 am

Of course! Glad I can help :)

8. Geordie Martinez April 10, 2017 / 1:01 am

this is a great way to teach simple python too. I was having my students parse a text file for the most common words from Einstein's Credo, but this is way more entertaining.
Thank you!

• [email protected] April 10, 2017 / 10:25 am

9. Joe April 10, 2017 / 10:48 am

I'm currently studying NLP in school (independent study). This is interesting stuff!

• [email protected] April 10, 2017 / 11:00 am

10. Andrey April 11, 2017 / 3:14 am

11. Ian Mahuron April 11, 2017 / 7:12 am

Fun stuff. I like the iterative approach as it helps highlight Python's use as a general-purpose tool. A few constructive suggestions from a fellow nerd:

- Make use of stderr to separate your scraper's progress from its output. Redirect curl's output to a file to see this technique in action.

- A single, short awk command would replace your entire post-processing command chain. It's a powerful tool and well worth learning.

- Post-processing of reviews could be avoided completely by fixing up lines as you read them into the chain construction utility. I suspect you know this and the steps are simply a side effect of your workflow.

- collections.defaultdict could eliminate the branch in your chain construction routine. This is a performance versus readability tradeoff. How much? The timeit package would tell.

• [email protected] April 11, 2017 / 9:21 am

Your feedback is great! Thanks :)
Your suspicions were right–my workflow is very.... step-by-step, you could say. I would have used points #2 and #3 if I thought about it before doing it.
Awk *is* really powerful! I use it in some of my other posts.
The STDERR is a great point! That would have been much better.