wordwatching

View Original

Reddit Ngram Viewer review

Recently, Randy Olson put together a tool that analyzes language data from the Reddit site, which attracts millions of visitors every month. Based on the same principles as the Google Ngram Viewer, but with a very different body of data at its disposal, the Reddit Ngram Viewer allows users to analyze a dynamic collection of electronic data comprising a wide range of linguistic variation. Like the Google Ngram Viewer, which I used in an earlier post on moonshine variants in the Rocky Mountains, the tool is free and relatively easy to use. Some of the more interesting examples of what people have found in the Reddit corpus using the tool can be found here.

In this post, I present an exercise that I conducted using the tool to point out some of its features. Specifically, I looked at references in the Reddit corpus to some of the major American holidays, as seen in the figure below:

Figure 1: References to major American holidays in the Reddit corpus (Lamont Antieau, wordwatching.org)

Christmas is quite clearly the most frequent holiday mentioned in the Reddit corpus, and in fact, its relatively high use obfuscates the performance of other holidays in the corpus. In order to highlight these less-frequently used holiday names, I omitted Christmas as a search term. At the same time, I omitted Eve so that New Year's would account for references to both December 31st and January 1st. The results are below:

Figure 2: References to major American holidays (except Christmas) in the Reddit corpus (Lamont Antieau, wordwatching.org)

As the figure shows, Halloween, Thanksgiving, and Easter are the most frequent holidays mentioned in the corpus, once Christmas is omitted as a search term. Thus, this exercise, even on the surface, shows how useful the tool can be for one interested in seeing a snapshot of language use in the corpus.

Now, some of the particulars. The number of search terms seems to vary depending on the length of the term used (terms can be from one to three words in length), but the highest number of terms that I was able to search for at one time was 10. This number is plenty, given how crowded the graph can become if too many terms are used. Also, if there is a great range in the frequency with which terms appear in the corpus, then the least-used variants get lost near the bottom, which, in this round, meant not also searching for less common variants in the corpus such as MLK Day, Memorial Day, Juneteenth, Independence Day, and so on. Again, this isn't a real problem in general, since if one orders the variants correctly, then high-ranking variants can be omitted to make room for lower-ranking ones until the list of terms that are to be searched for is exhausted.

The only real issue with the tool -- besides the demographic limitations of the Reddit corpus, which Olson points out, and which aren't really a problem as long as one is aware of them and doesn't try to generalize too much from the data -- is the inability (or at least the apparent inability) to use a wildcard to collapse related terms that are similar but not identical in form, as in this case:

Figure 3: References to Valentine's Day and Valentines Day in the Reddit corpus (Lamont Antieau, wordwatching.org)

Although it is uncertain whether the ability to add the two variant terms together would help the conflated Valentine* Day to overtake Easter in figure 2 above, it would provide users with a better idea of how variants that clearly refer to the same thing matched up to other variants in which such differences were not an issue, such as Christmas or Easter.

Despite this shortcoming, I'm looking forward to playing more with the tool and seeing what it can teach me about language variation, and perhaps even language change, given how quickly linguistic signs can evolve in the digital age.

See this content in the original post