How to make an absurd twitter bot in python

In my last post, I outlined the steps I took to programmatically mimic the wine reviews of a dilettante sommelier. In this post, I'll explain the steps I took to create the twitter bot @HorseWineReview which combines a random wine with a random computer-generated review. I'll keep it short and sweet–the steps are as follows:

  • get a list of wines (from Freebase)
  • create a twitter account and application
  • write script to create and post the tweet
  • automate it with a cron job

Get a list of wines (from Freebase)
Freebase is a collaborative knowledge base that uses a graph database to store semantic information. Information can be retrieved by running MQL queries on their web interface or through a google-powered API. We'll test a query first on their query editor online, but move to accessing the result from the API once we can verify that our query is constructed properly.
The query is very simple, it looks like this:

[{
  "name": null,
  "type": "/wine/wine"
}]

This will return a list of names of all entities of type "/wine/wine" in JSON format. This is an excerpt:

{
  "result": [
    {
      "type": "/wine/wine",
      "name": "1999 Domaine Romanee Conti La Tache"
    },
     .............

Now that we’ve confirmed that this query works and the syntax is correct, let's get the results by using the Google Freebase API. If you don't have one already, you need to get a Google Developers account. From the Google Developers Console, you need to grab a Freebase API key. After that, we can write and run a python script to retrieve and dump out all the wine names. You need the python module "freebase" which can be installed via pip. The code goes thusly:

#!/usr/bin/python
import freebase
import json
import urllib

api_key = "YOUR KEY HERE"
service_url = 'https://www.googleapis.com/freebase/v1/mqlread'

freebase_query = [{'name': None,
                   'limit': 999999999999999,
                   "type": "/wine/wine"}]

params = {"query": json.dumps(freebase_query),
          "key": api_key}

url = service_url + "?" + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())

for item in response['result']:
    print item['name'].encode('utf-8')

Run this script and redirect its contents to a file...

This stores a wine name on each line of the file.

Create a twitter account for your bot and register an application
After following the standard procedure of setting up a twitter account, head over to dev.twitter.com and set-up a developers account on behalf of your bot. Then head over to apps.twitter.com to create a new application; this will be the conduit with which we programmatically update your twitter bot's status. Make sure you read the Terms of Service and don't violate them. Make sure you also allow your application read and write access. After this, if you navigate to the "API Keys" tab, you should record the following information: the API key, API secret, access token, and access token secret. We'll need this to authenticate from our auto-posting python script.

Write script to tweet
We'll use the Twython package to authenticate and serve as an interface to twitter.

A bare bones script to perform this task looks like this:

#!/usr/bin/python

import subprocess
import random
import re
import os
from twython import Twython

twitter = Twython("YOUR API KEY",
                  "YOUR API SECRET",
                  "YOUR ACCESS TOKEN",
                  "YOUR ACCESS TOKEN SECRET")

def output_tweet(text):
    twitter.update_status(status=text)
    os._exit(0)

lengthoftweet = 999
# string to build tweet
tweet = ""

while lengthoftweet > 140:
    tweet = ""
    all_names = [wine.rstrip() for wine
                  in open("ABSOLUTE PATH TO WINE LIST FILE").read().split("\n")]
    wine_name = random.choice(all_names)
    tweet += wine_name + ": "
    rev = subprocess.check_output("ABSOLUTE PATH TO REVIEW GENERATOR")
    # we don't want a review to imply that the wine is either
    # red or white
    if "red" in rev:
        continue
    if "white" in rev:
        continue
    if len(tweet + rev.rstrip()) < 141:
        output_tweet(tweet + rev.rstrip())
    # if it is too long, try to use the first
    # sentence only (using regex)
    rev = re.search("(.+?\.).*", rev).group(1)
    tweet += rev.rstrip()
    lengthoftweet = len(tweet)
    
output_tweet(tweet)
os._exit(0)

A lot of the code is to ensure that the tweet doesn't exceed 140 characters. You might want to add a logging feature to the script–it will help with debugging the cron job.

Automate it with a cron job
If you're on a Unix system we can set up this script to run at specified times automatically using a cron job. Windows users can use Windows Task Scheduler but I've never used it so you're on your own.

Cron jobs are notoriously hard to debug. The #1 problem is encountered is not using absolute paths. When cron calls your script, the working directory is not where the script resides (unlike when you call the script on the command-line from the same directory). Because of this, any file IO in the script that uses relative paths will fail (quietly, if you don't add logging to the script).

The #2 problem you may encounter with cron jobs is that, because it runs as a detached process outside the login environment, the shell it executes it from may not be the one you normally use. Furthermore, it may not have the directories in the PATH that you need, or any other environment variables that you depend on.

The third problem that occurs often in botched cron jobs are bad permissions. Make sure you have all correct permissions (e.g. chmod +x YOUR_SCRIPT.py)

One final problem that I encountered was not putting an empty line after your cron entry–learn from my mistakes!

To add a cron entry, execute

in the shell

In the editor, I wrote the entry like this:

Your paths will obviously depend on what you are running and from where. Make sure you have an empty line after the entry!

To briefly explain this entry…

The first two numbers (00) specify the minute (0-60) to run the job. I chose to run it on the first (zeroth) minute of the hour. The second section (9,14,21) specifies the hour and tells cron to run the job at 9:00 am, 2:00 pm, and 9:00 pm. The next three sections (the asterisks) specify "day of the month" (1-31), "month of the year" (1-12), and "day of the week" (0-7), respectively. The asterisks indicate that this is to run every day of the month and every month of the year.

The next two strings instruct cron to run the script (which was explained in my last post), and the rest of the line redirects any output from logging to a text file called CRON.log

First endnote: Why @HorseWineReviews?
The name pays homage to the infamous twitter bot @Horse_ebooks that (used to) post (unintentionally hilarious) context-free excerpts and Markov chain clumps from books about horses in an (successful) effort to avoid looking like a spam account whilst occasionally tweeting links to promote the sales of e-books.

Final endnote
While the task of tweeting fake wine reviews has already been taken, if you are looking for ideas of a twitter bot of your own, dear reader, you might want to explore the following ideas that I think would a hit:

  • Train a Markov chain on equal parts famous philosophical works and vacuous and decidedly un-philosophical ramblings (a la KimKierkegaardashian and Kantye West). Philosophical corpora can be grabbed from Project Gutenburg. Vacuous babble can probably be obtained from choice subreddits or most of the trending hashtags on twitter.
  • Web-scrape a huge corpus of episode descriptions from various television shows, train a Markov chain on them and let it loose on Twitter. You can get episode descriptions from tvrage.com (example)
  • Train a Markov chain on the abstracts of academic papers.

You’re welcome :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to fake a sophisticated knowledge of wine with Markov Chains

How soon is now

Markov chain

To the untrained (like me), wine criticism may seem like an exercise in pretentiousness. It may seem like anybody following a set of basic rules and knowing the proper descriptors can feign sophistication (at least when it comes to wine).

In this post, we will be exploiting the formulaic nature of wine reviews to automatically generate our own reviews that appear (at least to the untrained) to be legitimate.

Markov Chains
A Markov chain is a system that transitions between states using a random, memoryless process. The transition from one state to another is determined by a single random sample from a (usually discrete) probability distribution. Additionally, the current state wanders aimlessly through the chain according to these random transitions with no regard to its previous states.

A roll-of-the-dice board game can be likened to a Markov chain; the dice determine how many squares you move, and is in no way is influenced by your previous rolls. Scores in basketball games appear to act in this way as well (in spite of the myth of the 'hot hand') and a gambler's earnings almost certainly hold the Markov property (see the Monte Carlo Fallacy)

Many more phenomena can be appropriately modeled by a Markovian process... but language isn't one of them.

Markov Chains and Text Generation
The image below shows a Markov chain that is built from the following lyrics:

How Soon is Now

Markov chain of The Smiths lyrics


I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

Here, each word is a state, and the transitions are based on the the number of times a word appears after another one. For example, "I" always precedes "am" in the text, so the transition from "I" to "am" occurs with certainty (p=1). Following the word "the", however, "son" occurs twice and "heir" occurs once, so the probability of the transitions are .66 and .33, respectively.

Text can be generated using this Markov chain by changing state until an "absorbing state" is reached, where there are no longer any transitions possible.

Given "I" as an initial state, two possible text generations are

  • I am the heir of nothing in particular
  • I am the son and the son and the son and the son and the son of a shyness that is criminally vulgar.

Very often, the generated text violates basic rules of grammar; after all, the transitions are "dumb" stochastic processes without knowledge of grammar and semantics.

Instead of a memoryless chain, though, we can build a chain where the next state depends on the last n states. This can still satisfy the Markov property if we view each state as holding n words. When using these 'higher order' chains to generate text, something very interesting happens. Since the states are now made up of clauses and phrases (instead of words) the generated text seems to magically follow (some of) the rules of grammar, while still being devoid of semantic sense.

The higher order the chain, more text needs to be fed into the chain to achieve the same level of 'arbitrariness'–but the more the generated text seems to conform to actual correct English. In order to fake our wine reviews, we are going to train an order-two Markov chain on a web-scraped corpus of almost 9,000 wine reviews.

The scraping
The corpus of wine reviews I chose to use was from www.winespectator.com. If you go to this site, you'll see that there 709 pages of reviews. I used SelectorGadget to determine the XPath selector for the content I wanted and wrote a few python scripts along these lines:

#!/usr/bin/env python -tt
 
import urllib2
from lxml.html import fromstring
import sys
import time
 
urlprefix = "http://www.winespectator.com/dailypicks/category/catid/1/page/"
 
#709
for page in xrange(1, 710):
    try:
        out = "-> On page {} of {}....      {}%"
        print out.format(page, "709", str(round(float(page)/709*100, 2)))
        response = urllib2.urlopen(urlprefix + str(page))
        html = response.read()
        dom = fromstring(html)
        sels = dom.xpath('//*[(@id = "searchResults")]//p')
        for review in sels:
            if review.text:
                print review.text.rstrip()
        sys.stdout.flush()
        time.sleep(2)
    except:
        continue

and grabbed/processed it with shell code like this:

# capture output of script
./get-reviews.py | tee prep1.txt

# remove all lines that indicate progress of script
cat grep -E -v '^-' prep1.txt > prep2.txt

# add the words "BEGIN NOW" to the beginning of each line
cat prep2.txt | sed 's/^/BEGIN NOW /' > prep3.txt

# add the word "END" to the end of each line
cat prep3.txt | sed 's/$/ END/' > wine-reviews.txt

This is a sample of what out text file looks like at this point:

BEGIN NOW A balanced red, with black currant, ... lowed by a spiced finish. END
BEGIN NOW Fresh and balanced, with a stony ... pear and spice. END

The "BEGIN NOW" tokens at the beginning of each line will serve as the initial state of our generative Markov process, and the "END" token will denote a stopping point.

Now comes the construction of the Markov chain which will be represented as a python dictionary. We can get away with not calculating the probabilities of the transitions by just storing the word that occurs after each bi-gram (two words) in a list that can be accessed using the bi-gram key to the chain dictionary. We will then 'pickle' (serialize) the dictionary for use in the script that generates the fake review. The code is very simple and reads thusly:

#!/usr/bin/env python -tt
 
import pickle
 
fh = open("wine-reviews.txt", "r")
 
chain = {}
 
def generate_trigram(words):
    if len(words) < 3:
        return
    for i in xrange(len(words) - 2):
        yield (words[i], words[i+1], words[i+2])
 
for line in fh.readlines():
    words = line.split()
    for word1, word2, word3 in generate_trigram(words):
        key = (word1, word2)
        if key in chain:
            chain[key].append(word3)
        else:
            chain[key] = [word3]
 
pickle.dump(chain, open("chain.p", "wb" ))

Finally, the python script to generate the review from the pickled Markov chain dictionary looks like this:

#!/usr/bin/env python -tt
 
import pickle
import random
 
chain = pickle.load(open("chain.p", "rb"))
 
new_review = []
sword1 = "BEGIN"
sword2 = "NOW"
 
while True:
    sword1, sword2 = sword2, random.choice(chain[(sword1, sword2)])
    if sword2 == "END":
        break
    new_review.append(sword2)
 
print ' '.join(new_review)

The random.choice() function allows us to skip the calculation of the transition probabilities because it will choose from the list of possible next states in accordance with the frequencies at which they occur.

The results
Obviously, some generated reviews come out better than others. After playing with the generator for a while, I compiled a list of "greatest hits" and "greatest misses".

Greatest hits

  • Quite rich, but stopping short of opulent, this white sports peach and apricot, yet a little in finesse.
  • Dense and tightly wound, with taut dark berry, black cherry and red licorice. A touch of toast.
  • Delicious red licorice, blood orange and ginger, with nicely rounded frame.
  • This stylish Australian Cabernet is dark, deep and complex, ending with a polished mouthful of spicy fruit and plenty of personality.

Greatest misses

  • From South Africa.
  • Tropical fruit notes of cream notes.
  • Here's a bright structure. Dry and austere on the finish.
  • This has good flesh.
  • Really enticing nose, with orange peel and chamomile for the vintage, this touts black currant, plum and meat notes. Flavors linger enticingly.
  • Blackberry, blueberry and blackberry fruit, with hints of cream. Crunchy and fresh fruit character to carry the finish.

Possibilities for improvement
The results are amazing, but the algorithm needs a little work before it will be able to fool a sommelier.

One major giveaway is the inclusion of contradictory descriptors in the same review. I don't know anything about wine (I drink Pepsi) but even I know that a wine should never be described as both "dry" and "sweet". One possible solution to this would be to use association mining to infer a list of complementary and discordant descriptors.

Another clue that these reviews are nonsense is the indiscriminate chaining of clauses that have nothing to do with each other. I'm not quite sure how to solve this, yet, but I have a few ideas.

An additional hiccup is that there are still grammatically incorrect sentences that creep through. One solution would be to identify and remove them. Unfortunately, this is much easier said than done. In the absence of a formal English grammar, we have to rely on less-than-perfect techniques like context-based identification and simple pattern-matching.

The last obvious problem is that some of the generated reviews are just too long. This increases the likelihood of containing contradictory descriptors and committing grammar errors, as with this review: (which also exemplifies all of the problems stated above)

Luscious, sleek and generous with its gorgeous blueberry, raspberry and blackberry flavors , with hints of herbs, cocoa and graphite. The long, briary edge lingering on the nose and palate. Medium-bodied, with a modest, lightly juicy and brambly flavors of milk chocolate. Full-bodied, with fine focus and its broad, intense and vivid, with a tangy, lip-smacking profile. Light-weight and intense, with a deft balance.

In future posts, I hope to explore some of these avenues of improvement. I also plan to use parts-of-speech tagging to automate an unusual games of wine review mad-libs.

Sooner, though, I'll explain the process I took to set-up the fake wine review twitter bot (@HorseWineReview) that I will use to experiment with different text-generation techniques.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail