abm (14) amp (18) ascape (6) biomed (6) business (22) butterflyzer (9) dharma (12) eclipse (62) emf (7) graphics (10) ip (8) java (35) life (5) osx (13) science (13) web (6) xpand (5)

Thursday, December 22, 2011

Forecast: Spring Clouds and SOPA.


I've been thinking a lot about how to marry the visual appeal and density of representation you get with Word Clouds such as the Wordle and Eclipse Zest Cloudio with the relational information conveyed by graphs. I discussed this a bit in a Recent Post showing Clouds created using Butterflyzer depicting thousands of Tweets from the Arab Spring.

I thought that effort was pretty cool, but the limitations were really obvious. Not coincidentally -- clouds are in the air after all -- Jacob Harris, the guy responsible for many of the New York Time's brilliant infographics (hate the word, but it is descriptive) posted a scathing but quite legitimate critique of word clouds, ending with the line "Every time I see a word cloud presented as insight, I die a little inside." I like this guy's style! Though I think Jacob doth over-harsh a bit. Word clouds have their place. My take is that they are great for depicting the madness of crowds but lousy at deciphering it. As I said in my original post:

"One of the downsides of this approach is that unlike in our graph visualizations, you lose concept proximity. So for example, looking at the cloud you can instantly see that Tunisia and Lebanon are frequently mentioned, but what you can't see is how frequently they are used together. That's what our graph visualization tools do now, and I think there are some neat possibilities for marrying the two approaches."
After I posted this, a couple of my Eclipse compatriots -- Ed MacKerrow and Fabian Steeg -- shared that they'd been thinking along similar lines. I've been playing with this over the last few weeks (probably too much, given that no one has given any indication that it's a feature they're willing to pay for) and I've come up with an initial implementation in Butterflyzer. As always, there are trade-offs and no visualization is perfect, but I'm pretty happy with the results.


A SOPA Opera 


Let's use the latest attempt (editorial alert) by the greedy and small-minded to destroy the Internet as our subject. A search on Tweets containing "SOPA" returns a sample of 10,000 or so tweets from the last week. This is what a Butterflyzer produced "Spring Cloud" looks like:





You might notice something right away. See the group of Spanish words to the bottom left? They probably don't have anything to do with the kind of SOPA we're interested in. (Though reading the related Tweets made me hungry.) In a Word Cloud we'd never be able to separate them out of the rest of the clutter, but using the Spring Clouds and Butterflyzer we can eliminate those irrelevant nodes -- along with all of their tweets -- with a few mouse clicks. Hey, we could call that "Cluster Analysis for the Rest of Us"™. (Lame Apple reference numero uno.) Now we can see just the stuff we care about:





Right away we can see how things are associated with each other. Sometimes the associations are obvious; for example when we see "vote" and "house" (bottom middle, in white) together. That's useful, and sometimes you can even pick the content up just by looking at the tags, as in "unconstitutional", "stanford" and "review". But many of the relationships are much more interesting. For example, notice that NDAA comes up quite a bit. When it does come up, it seems to be associated with "@barrackobama, "obama" and also interestingly "#OWS". That gives real insight into the meaning and relevance of the tweet terms, not simply their existence.

We can infer more subtle information than that from the diagram; for example, that the relationship between Obama, NDAA and SOPA in the twitter discourse is organic -- that is, probably not reflecting an orchestrated Tweet campaign, a few viral Tweets or say, Fox News. (If you're looking for those guys, you'll like the video below.) Instead, the use of multiple terms for Obama might indicate that people are associating the president and these two legislative (editorial alert) attacks on the Bill of Rights without any help from say, the mainstream media. And in fact, if we look at the individual tweets, that's exactly what we do see. In the following list we've eliminated the RTs, but the RT ratio to original content is also quite light. If we Focus on those Terms, this is what we see:




Now, if you're someone from the White House or DNC evaluating the potential impact of SOPA on a particular community, it could be useful to be able to get to that kind of information quickly. That's the kind of thing that Butterflyzer was built to do.

The Details


So, how are we creating these Spring Clouds? No secrets or bogus patent claims here. (Perhaps you or someone you know has discovered this solution already; if so let me know and I'd love to add a reference to your work here.) I'll outline the basic approach below...it's really pretty simple. If you're a visualization geek, try to figure about how this was done before reading further. (The name is already a pretty big hint.) If you're not a visualization geek, then good news -- you don't have to figure out any of this. Butterflyzer handles all of the details, and you can set it all up with a few mouse clicks to customize the graph view. See the section below for how.

Preparing the data:
  1. All of the Tweets are collected using Butterflyzer's automated search tools.
  2. Butterflyzer identifies all Terms (words) that appear in any Tweet and indexes the Tweets against those.
  3. We then drop: some common english terms; terms with occurances below some threshold (say 250); and the original search words and any other really common connections.
The visualization algorithm:
  1. Create nodes representing a) Terms and b) Tweet Groups. Tweets are grouped together whenever they have exactly the same set of edges. Edges connect the terms to any Tweets containing those terms.  Butterflyzer does all of this automatically based on your specifications.
  2. Layout the Graph using a Spring Layout Algorithm. We use the Eclipse Zest project's Spring Layout Algorithm written by Ian Bull and Casey Best, heavily hacked to provide thing like continuous updating.
  3. The Trick: We then hide the Tweet Groups and their connecting edges.

So, below we can see the graph as it "looks" to the layout algorithm.


And again, here's the same thing with just the Terms:


Now, let's remove the Terms and edges instead of the Tweets:



Each circle represents a group of tweets, and the area of the circle is proportional to the number of Tweets in the group, just as you'd expect. As you can see, we're using a lot of data and a lot of computation for what looks like a pretty simple word cloud.

Curious about what those Tweets actually say? Again, that's easy to find out. Just mouse over one of the nodes (not here, silly -- you'll have to download Download Butterflyzer first, see below) and you can read them, in all their Twittish glory.




So again, this thing might actually be useful, right? Suppose you're a Legislative Assistant for one of the senators who will be deciding whether to vote this (editorial alert) heinous bill out of committee. You might want to dig into what folks that are interested in a particular aspect of the legislation -- such as copyright -- are actually talking about (and perhaps whether your boss or any of his or her campaign donors come up in any of them). You can quickly scan through the cloud and associated terms, and even identify interest clusters.

Limitations


As I said above, all visualizations have limitations. They're on a tradeoff landscape like everything else. Here, the tradeoff vs. Clouds is information density. Because a Cloud diagram doesn't care where words go it can pack things wherever they fit best. We can improve the Spring Cloud layout from where it is now, but there is is a natural limit. Also, the Spring Layout algorithm we're using isn't very good at handling cases of overlapping nodes. Sometimes that's cool, but sometimes it's just messy. Especially since SWT doesn't support alpha for text (but does a great job for shapes as you can see above) the text can be difficult to read and look crowded.

Another issue is that the underlying spring algorithm is in O(n^2). That's geek speak for scary and ugly. As the number of nodes increases -- and as you can see from the examples above, it definitely will increase -- the cost of building a graph increases with the square of the increase in size. There are techniques to mitigate that, but computational costs are a limiting factor, especially for a tool like Butterflyzer that needs to do things like respond to user input in a timely way. We've spent a lot of time trying to tune the UI to be responsive while still putting enough cycles toward the layout, but there is definite room for improvement there.

Finally, compared to Word Clouds, Spring Clouds just aren't that sexy looking. Because we can't just put words wherever we want, things are unbalanced and lack symmetry. We can improve that, but again, only within limits. If truth and beauty have a love-hate relationship, sometimes you're just going to have to pick sides.

Try It!


Want to try it for yourself? Here's what you need to do:

  1. Download and install the Butterflyzer Beta.
  2. To do it yourself:
    1. Do a search on SOPA (or whatever you like) by typing it into the search field.
    2. Click the "Tweet Terms" node under "Collections".
    3. You might want to do some clean up as described above and set up your views to show only the graphs and outline. See the extensive Butterflyzer docs for more on that.
    4. From the Filter Menu, select "Cliques".
    5. From the "Draw Options" menu, Unselect (by clicking) "Show Items", "Show Items Text", "Show Icon", "Scale Count" and "Scale Topsy". Select "Scale Related".
  3. Or just download SOPA Example and open it from within Butterflyzer.
Have fun, and I look forward to seeing your own creations!

Sex, Lies and Spring Cloud Video


Oh, and one more thing... (Lame Apple reference number two, if you're counting.) A nice aspect of the Spring Layout in general and the Spring Cloud is that it is a continuous algorithm, which means that the visualization improves over time, but that also you can add and remove information from the graph without forcing a new layout. This means that we can actually animate the Spring Cloud over time, as I've done in the movie below. Here we're looking at the evolution of Tweets about the Republican Primaries over a span of a week.





There aren't docs on how to do this part yet, but if you want to experiment with this new feature, just select "Select Time Span" from the "Filter" Toolbar Menu or the "View" Menu.

Popular Posts

Recent Tweets

    follow me on Twitter