There was a deep yellow field in the woods, and I felt the sudden shift of color. I couldn’t help but notice that as soon as they left the field, I had no idea what to do with her. It was the most beautiful place I had ever seen in my entire life, and it seemed to be the most important thing in the world. In spite of the fact that he was walking away from her, he gave me a final nod. The wind beat against the forest floor, Violet’s blood flowing through her veins.
You have read articles that were written by machine learning algorithms. The short story leading this article was generated by a machine learning model that was trained to caption images; it learned how to caption using the language of romance novels. Stock market evaluation and swings, weather forecasts, school report cards, real estate listings are just a few examples of text that are being entrusted to algorithms. More and more these models will replace copy-writers. It might not be too long before they are used to generate full novels in their entirety.
This isn’t the near-future, this is the now. This kind of work belongs to a subfield of data science called natural language processing (NLP). Within the realm of NLP you will find applications ranging from text sentiment analysis to natural language generation and a whole manner of other applications in-between. Here’s a flavor of the kind of work that is being undertaken in the world of natural language processing:
- Search and autocomplete: How can Google guess the next word you are going to type into that search bar window? They have compiled data on millions (hundreds of millions?) of searches and can probabilistically estimate what word you are likely to type after “best tuna “.
- Topic analysis: Determining the major themes active throughout a segment of text. This can be thought of as clustering of the major ideas, common words, and patterns throughout the chosen piece of text;
- Sentiment analysis: Determination of the emotion or opinion behind the text. Does a tweet that reads: “The President’s plan on infrastructure is just the GREATEST the world has ever seen!” express positive emotion or is there some sarcasm there that might reveal a negative attitude? Sentiment analysis attempts to uncover those expressions.
- Dialog (Chatbots): These interfaces are likely to become much more common soon. Receiving text from users and responding to it coherently and with the customer’s answer is a field that could be lucrative if done correctly. If you are interested in python programming and development there are some open-source chat bots available to explore.
- Text Summarization: Is what it sounds like — taking a length of text and condensing it down into a few words. There are some existing open-source repositories that are very good at these tasks already. Many are easy to use too: for example, I just grabbed the Wikipedia page for Natural Language Processing and had summa summarize it in 250 words:
In recent years, there has been a flurry of results showing deep learning techniques achieving state-of-the-art results in many natural-language tasks, for example in language modeling, parsing, and many others. Formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural-language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples (a corpus (plural, “corpora”) is a set of documents, possibly with human or computer annotations). Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. Given a sentence, determine the part of speech for each word. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Given a sentence or larger chunk of text, determine which words (“mentions”) refer to the same objects (“entities”). One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).
- Natural Language Generation: A very complicated subject area. Once you start thinking about how you write, what makes writing good and sensible and complete you start to get an impression of how difficult it is to generate natural language. Briefly, one major problem is retaining the coherent thought between words then across sentences, paragraphs, and chapters. Teaching the model to remember the thought that is being expressed along with the interweaving plots is daunting.
The chart on the left, which was created to explore how long until an artificially intelligent bot takes your job, estimates that in between 20 – 40 years these algorithms will write a novel that gets onto the NY-Times best-seller list. Before writing that novel though they could be generating the next cheesy, somewhat incoherent teen romance novel that has become so common.
In fact, each November a group of individuals composed not only of data scientists but devoted writers and authors participate in an international writing sprint. The challenge in NaNoWriMo is to finish a 50,000 word novel in the month. It’s a challenge that forces aspiring writers to stop procrastinating and start doing. If a machine learning algorithm has successfully participated in this content I do not know it (it’s not that difficult to get an algorithm to crank out something the trick is to make it not an entire mess). However, there are data scientists who have taken baby steps in the month — Janelle Shane trained a neural network to generate the first sentence of a novel.
I pulled onto the river at the exact same moment. It was a vast expanse of land, and most of the time, I wondered what it would be like to walk away from my father, but he hadn’t seen her in a very long time. The sun had begun to rise above the lake and the forest thinned out, making it difficult for him to follow. But it was also true that she had no idea what she was capable of. Her words burned in her mind, and that was a great deal of power. She could barely walk off the cliffs, much longer to mark it.
It is difficult to estimate the degree to which organizations or individual are using these algorithms in their commercial applications. Many aren’t divulging it publicly. And who can really blame them for wanting to be anonymous — knowing that automation is behind the words you’re reading is unsettling. As if the machines have already taken over. As if humanity generally is ceding ground to the inevitable tide of non-humanity, semi-humanity.
But what are these machines writing?
As of today, the applications that are used commercially are a kind of mad-lib style fill in the blank. As long as the text you’re writing follows some formula, and you have the inputs to that formula, you could convincingly write an article. The Robo-Realtor from Fast Forward Labs is a good example of what I mean:
Fast Forward Labs, a venture started by acclaimed data scientist Hilary Mason, demonstrates that if you know the location, the number of bedrooms, the number of bathrooms, square footage, and other standard attributes you can devise a system that automatically generates real estate listings for human consumption. It isn’t difficult to understand conceptually. And the only limiting factor these days seems to be access to clean data to model your interest, but that limit on formatted data may be lessening.
Using structured data to generate text
The concepts in these papers have been codified into open-sourced software that is available via Github. Now anyone familiar with these concepts can grab the data to experimenting on their own. Think about what other data might be locked up within tables. Unlocking this information could help open new areas of natural language generation applications.
So what’s going on with these captioned images that are interspersed throughout this post?
In the spirit of Valentine’s day, I played around with neural storyteller. The storyteller idea, implementation, and code were developed by computer scientist Jamie Kiros. The stories in this article are those generated by using her code. It wasn’t hard for me to grab her repository here, launch an Amazon instance containing a GPU and machine learning utilities, load new data and generate these stories.
It does take a bit of knowledge and skill set up — so if you would like to see a technical walk-through of this process, let me know!
Think about what this neural network must be doing to generate these stories. First, it needs to recognize keys items within the picture itself. The following image has a very recognizable feature – the fireplace, dining room, and mantle. After identifying those images it can associate them with sentences it has been trained on and insert some of those features into the paragraph. The neural networks in this example were trained on romantic text and thus can produce it when shown these images. If were were to train the network on adventure books, for example, you might see results like the following.
A view of the near future
The individuals behind Botnik studios are at the forefront of efforts to use machine learning to generate content. Billing themselves as a community of writers, artists, and developers Botnik has gained press coverage by building models to help generate Seinfeld episodes, Jeff Bezos quotes, Christmas songs amongst many others. They received a significant amount of press for generating a new chapter in the Harry Potter series.The infrastructure they have developed, predictive keyboards, suggests words to add to your story. Like Google search, Botnik helps finish your sentence. What’s also great about Botnik is that you can join them, right now. They have a Slack channel that, last time I checked, was open to receiving new people.
But what about real story generation, from scratch, how bad is it? The answer is that for these story generators you need to input a prompt. Researchers and journalists are continually interested in seeing just how well these algorithms can write. And someday they are likely to be pretty good at it. For now, mad-lib style and assisted writing will dominate. With the vast amounts of data of the human experience available via the internet, these algorithms will eventually figure out how to write a good story.
Maybe they already have 😉
Above the living room, I noted with fireplace and dining room. I expected to be honest, but I didn’t want anything to do with my own. It was one of the most beautiful furnishings I had ever met. In that moment, it seemed as if he had no idea what was going on inside and out. The room was full of pillows, pillows, and pillows that filled the living room. She kept her gaze fixed on Liam ‘s fireplace, which she could very well appreciate.
Technical Bonus material
If you are interested in learning more about NLP and are familiar with the Python programming language, you won’t have much trouble diving into the following libraries.
- Natural Language Toolkit (NLTK): Is a foundational library for natural language processing. This library comes equipped with texts that can be used as example input.
- Gensim: Is a library useful for topic modeling and other statistical analyses of text. The library is useful for TF-IDF analysis.
- Chatter: Is just one example of a library built for chatbox. It describes itself as: “a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on collections of known conversations.”
Have other modules you commonly use for your NLP research? Let me know!