Visualization Strategies: Text & Documents
Whether it’s a campaign speech by a presidential contender, or a 300 page bestselling novel, large bodies of text are among the most requested topics for condensing into an infographic.
The purpose can vary from highlighting specific relations to contrasting points or use of language, but all of the following methods focus on distilling a volume of text down to a visualization.
Volumetric Comparisons:
Tag Clouds & Wordles:
Among the most common visualizations is the so called ‘Tag Cloud‘, which is just a list of words or descriptions that are sized by some relevance measure, usually number of repetitions in the data set.
Tag clouds are often used to highlight and compare themes within the document, such as stripping down a U.S. State of the Union Speech to its major keywords such as Iraq, Budget, etc.
Closely related to tag clouds are wordles, which are more artistically arranged (and often vibrantly colored) versions of a text. They tend to be less directly insightful as an infographic, but often give a more personal feel to a document.
A number of free tools online exist to create Tag Clouds and Wordles, I’ve personally found the tag cloud tools at Many Eyes to be excellent, as well as the versions at Swivel. Wordles originated at Wordle.net
Word Spectrum Diagrams
Chris Harrison introduced these Word Spectrum Diagrams with a project meant to show related word bi-grams, or common word pairs.
The technique could easily be adapted to show relations between other sets of words, such as common idiomatic phrases used by speakers, or the word preferences of an author.
Structure and Document Flow
Document Contrast Diagrams
Document contrast diagrams use the familiar bubble technique and effective use of color to contrast topic usage in two bodies of text.
Unlike many of the other infographic techniques featured here, DCD’s help highlight key differences of the text as well as the similarities.
They’re discussed more in depth at Neoformix.
Literary Organism Maps
This next visualization technique comes from Stefanie Posavec, and while a bit less obvious than the others, offers some intriguing possibilities.
It purports to offer a scaled ‘map’ of where each chapter goes within a textual context. While not as intuitive as tag clouds or word maps, variations of this style could be used to track ‘threaded’ text like conversation transcripts.
Her project page has even more amazing work in the same vein, although most of it is more artistic than analytical.
Word Trees
Word Trees are another document flow visualization from the folks at IBM Many Eyes.
They help provide context to unstructured text, showing the relations between major words and phrases and their follow-ups at a glance.
Font scaling helps to show importance and relative frequency among the parts, and easy searching lets a user follow a path from one concept to another throughout the text body.
By treating follow up phrases as links (almost like a web hyperlink), a user can easily navigate a speech by subject, which makes following a large body of text by theme a breeze.
An obvious application of this technique would be highlighting correspondence between two parties, whether by letters or telephone/email transcripts.
Document Arc Diagrams
Another winner from the NeoFormix team is the Document Arc Diagram.
Hover over a text fragment highlights all the related text fragments, or ‘document arcs’. This allows for the relations to be visible at a glance when one sentence may be linked to many contexts or related words.
The site includes a handy generator so that anyone can build a custom interactive diagram.
Large Corpus Techniques
Transcript Analysis
The New York Times offered this visualization after one of the 2008 Democratic primary debates.
Annotated speech blocks showed the frequency of custom words, and highlighted occurrences and context.
Although this required a highly annotated text, it allowed an in depth searching an comparison of the text without compare.
The biggest challenge to recreating this for other purposes would be the large amount of document tagging required (presumably by hand), although certain transcript type data could probably be automatically tagged and fed into a similar system.
Directed Sentence Diagrams
The final technique is a bit unconventional, but Directed Sentence Diagrams (again from Neoformix) are designed to show the topic ‘flow’ in a body of work via color and cartesian length.
While sentence drawings aren’t new, the idea of ‘directing’ them as well as color coding them to show sentence length and topic imparts much more information into the space.
The sparse filling and line lengths gives a great overall picture of the percent of the document (or speaking time) given over to a given topic, and a careful analysis allows the reader to follow the outline of an argument’s points or the meandering of a story’s subjects.
Perhaps even moreso than the other diagrams this one requires a well annotated text, since every sentence must be have meta data on its subject, but for certain sample sets (particularly persuasive speeches or editorial style arguments) the technique is particularly effective.
Formatting the Data
One of the things in common with all text visualization techniques is effective use of sizing and color. Gradients and gradual font sizes can show relative importance, while opposing colors and sharp contrast can highlight points of contention.
Other primary challenges for the designer include trimming the text down to its vital elements, and pre-processing the data to remove stop words or do some sort of stemming to having a ensure a clean final product. At bare minimum, removal of particles such as ‘He’ and ‘The’ is generally necessary for any word prevalence visualization.
Automated text processing is often critical with larger bodies, whether it be simple techniques like using PHP’s str_replace function on particles and stop words, or more advanced methods like stemming with Python’s Natural Language Toolkit.
















August 21st, 2008 at 10:19 am
A nice review of the state of the art — I’m a big fan of the Neoformix visualizations. And, well, I’m working for the summer with the IBM VCL (the Many Eyes group) so I’m a big fan of them too.
One thing I noticed in your post is that you call a single presidential debate transcript to be a “large corpus”. To me, this is just another long, stand-alone document, like a long book. The techniques they employed could be easily used on something like a play, where the speakers are already clearly noted. What we haven’t seen a good visualization for yet are really large corpora — think all the patents ever filed, or every NY Times article for a 100 year period. There have been interesting attempts (my personal favourite is MIT’s Gist Icons http://www.media.mit.edu/cogmac/publications/IEEEIcons.pdf) but this is still an exciting area with lots of opportunity for contributions.
August 21st, 2008 at 12:23 pm
[...] am a huge fan of data visualization. Here is a nice overview of some interesting textual visualization techniques. Other than tag clouds this [...]
August 21st, 2008 at 2:48 pm
[...] August 21, 2008, 1:48 pm Filed under: 1 Yesterday, Tim Showers posted an interesting piece called “Visualization Strategies: Text & Documents” which features numerous techniques for visualizing text from [...]
August 21st, 2008 at 4:08 pm
There is so much exciting work in visualization going on right now, thanks for this great collection.
August 21st, 2008 at 7:11 pm
Thanks Tim! You may also be interested to check out our TouchGraph Navigator, http://www.touchgraph.com/navigator.html The page links to a demo that lets you browse a network of word associations. The Navigator is not specifically a text analysis tool, but it can be used to visualize textual data, as well as relations between tags.
August 21st, 2008 at 8:30 pm
You’re absolutely right about ‘large corpus’ being a relative term Chris!
I think that in the information visualization space we’ve seen a fair amount of work towards mapping relative change (Things like heat maps and time series flows) and also a focus on visualizing network connections (possibly because of the ease of representing them as node-graphs), but in many other areas, the state of the art isn’t much past Tufte and what you can do with Excel.
I’m really looking forward to where the space can go with
1) More generalized data, especially textual.
Things like your examples of patents or massive numbers of internet message board threads. I feel like this has been bogged down by the massive amount of metadata required to process these things, but with the open API/machine readability movement, this is changing fast.
2) Processing and dealing with multi-level data. I haven’t seen much innovation outside of zoomable treemaps and multilevel pie/bar charts, but I think that the state of the art will catch up quickly here too.
The response to this article has been so great, i’m planning to do a whole series examining the visualization space for various verticals in the near future, so keep your eyes peeled!
August 21st, 2008 at 8:51 pm
There have been many attempts at visualizing message board threads but I think the research community has moved on a bit thinking about web 2.0 data issues as threaded conversations (like usenet) become less common. Some of the older examples are great though, like Newsgroup Crowds and AuthorLines (http://alumni.media.mit.edu/~fviegas/authorlines/index.html) (yes, I’m biased, see my previous disclosure re where I work). There is also a wealth of work for multi-level data. We (the InfoVis community) have a technique called “semantic zoom” which varies the layout and even the type of vis based on the zoom level in the data. I look forward to seeing what you find for future articles, I’m sure there will be some surprises.
Those who are interested could also browse the excellent archives at infosthetics.com and visualcomplexity.com
August 22nd, 2008 at 1:31 am
[...] Visualization Strategies: Text & Documents » Tim Showers - Web Development, Design, and Data Vi… (tags: visualization typography tags) [...]
August 22nd, 2008 at 3:15 am
Visualization Strategies: Text & Documents » Tim Showers - Web Development, Design, and Data Visualization» Blog Archive…
[...]A survey of techniques for creating infographics and visualizations from textual data.
Saved By: winkywooster | View Details | Give Thanks[...]…
August 22nd, 2008 at 8:07 am
[...] a site that displays various types of text visualizations. My favorite is the literary organism map, at least in terms of abstract beauty. I’d be [...]
August 22nd, 2008 at 9:03 am
[...] Visualizing text is a difficult thing to do. More so when you have 2 texts and need to compare. Document contrast diagrams seem like an interesting way to explore two large pieces of text to see how they space out. The state of union address for 2007 and 2008 are shown above, 2007 is on left. [via Tim Showers] [...]
August 22nd, 2008 at 11:37 pm
These are all really cool. I’m a big fan of presenting interesting aggregate information about data visually - it can help identify patterns in a large block of noise. For example, here’s a word cloud of Paul Graham’s essays: http://modos.org/pg.png You can see the prominence of each of the words in the Grahmism “make something people want”, indicating his commitment to that philosophy.
August 23rd, 2008 at 12:03 am
[...] Visualization Strategies: Text & Documents » Tim Showers - Web Development, Design, and Data Vi… (tags: visualization web text data) [...]
August 23rd, 2008 at 2:33 am
[...] Visualization Strategies: Text & Documents Posted 23 August 2008 / IR / [...]
October 3rd, 2008 at 10:04 pm
I never realized the variety of visualization techniques that existed.. Thanks for the article!
October 29th, 2008 at 1:18 pm
You may have come across http://vizlab.nytimes.com/
October 31st, 2008 at 2:11 pm
[...] Tim Showers did a blog post not too long ago called “Visualization Strategies: Text & Documents” where he introduced me to a Wordle. A Wordle is also embedded above from the NY Times [...]