Getting band and artist names with NLTK

  • By John
  • February 28th, 2011

Using the Natural Language Toolkit for Python to extract band and artist names

In the newest version of Gatekrash, listings are sometimes accompanied by descriptions. These descriptions often contain a variety of information, such as the dress code, minimum age, drinks offers, parking information, and so on. Quite a lot of the time, they also contain band names and artist names, for things like what kind of music you can expect to hear, or who you can expect to hear.

Extracting that information reliably is difficult. There are several ways of approaching the problem, but I opted to use named entity recognition, a process which extracts 'named entities' (people, places, organisations, etc.) from text. Building my own system would be very difficult and very pointless, especially when there are already tried and tested pieces of software that do the job well.

That's where the NLTK steps in. The Natural Language Toolkit is a collection of open source Python modules which assist in text analytics - including named entity recognition. Downloading and installing the NLTK is a simple process and only an apt-get away.

The way it works is like this

  • Tokenise the text string - splitting the text into words, sentences and so on
  • Tokenise and tag the words in the text string - assign word types to the tokenised text string, such as noun, pronoun, adverb, adjective, etc.
  • Chunk the tagged text string - organise the text string into distinct elements of a sentence (using a parse tree), for example S for Sentence, NP for Noun Phrase, Det for Determiner, etc.
  • Analyse the parse tree to extract information - this is the part where named entities are extracted from the chunked text (parse tree)
  • Filter out everything except 'PERSON' entities (i.e. ignore ORGANISATION, PLACE, etc. entities)

The NLTK provides all these features, and only relies on training corpuses in order to be able to properly interpret text. In fact, it's so simple to get the NLTK to do this, I'll show you the actual code I used in Gatekrash:

import nltk

def _getPerformances(self, text):
        performers_list = array()
        text = re.sub(r'\W+\d+\s+.,\'"&', '', text)
        for sent in nltk.sent_tokenize(text):
            for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
                if hasattr(chunk, 'node'):
                    if chunk.node == "PERSON":
                        performer = ' '.join(c[0] for c in chunk.leaves())
        return performers_list
        print "      ERROR: Couldn't perform named entity recognition on this text"

I actually slightly expanded on some readily available examples about the NLTK by Tim McNamara (I added in the 'PERSON' filter, and the try-except blocks). Processing is fairly processing-intensive, mostly depending on the size of the text you're analysing. Initially importing the NLTK and performing the first named entity recognition process takes significantly longer than any repeat processing if you're running it through a loop (as I am in this case).