December 30, 2012

An introduction to data visualization

I've recently discovered something called Infographics which could be explained as the art of explaining something difficult in a picture. This something difficult could be a lot of numbers, or an event such as the sinking of the Titanic. The goal of the Infographic is that someone without any knowledge within the subject the picture is explaining, immediately should get what the picture is about. If you need an example, you should visit which is an Internet service where you can show the finished Infographics you've made.

So, how do you create an Infographic? The reason I'm asking is because there's a competition going on at Kaggle where the goal is to visualize the school system in Colorado, and I'm participating in that competition. I've never made one before, so this will be a summary on how to do one. To learn how to make an Infographic, I've decided to read some books and watch some online tutorials on YouTube.

The best book I've found on the subject is Envisioning Information by Edward Tufte, and it was recommended to me by Tim Ferriss in one of the Random Show episodes. Edward Tufte is a professor at Yale University and has written several books on the same subject. Envisioning Information is a quite small book, around 130 pages with many pictures, and it explains how to represent a rich visual world on "flatland" where flatland is something flat such as paper, but could also be a memorial and similar structures. One example is how to represent a subway system on a map to make it fast and easy to understand if you have never entered the subway before. A tourist in London should immediately understand,without any confusion, how to travel from the hotel to Madame Tussauds.
Edward Tufte explains that to envision information, you should work in the intersection of image, word, number, and art, using visual principles that tells us how to put the right mark in the right place. Here are some of the visual principles from the book:
  • Avoid "chartjunk." Chartjunk is the art of decorating a chart with "fluff", such as unneeded pictures or dark grid lines, to make it look more interesting, but the chart will also be less credible to the spectators. Decorations are never needed, and if the numbers are boring, then you've got the wrong numbers. You can still use techniques such as colors, typography, layout, and similar, as long as you avoid unneeded junk.  
  • Respect the audience. Consumers of graphics are often more intelligent about the information at hand than those who fabricate the data decoration. The audience may be busy, but they are alert and caring - not stupid. 
  • To clarify, add detail. Thin data may lead to suspicions: "What are they leaving out? What are they hiding?"
  • Clutter and confusion are failures of design. It is not how much information there is, but rather how effectively it is arranged. 
  • Use a panorama. A panorama deliver to viewers the freedom of choice that derives from an overview, a capacity to compare and sort through detail. When appropriate, you can combine a panorama with a more 2-dimensional picture.  
  • 1 + 1 = 3. White space is something. Add more shapes, and thus spaces between the shapes, and the amount of noise will increase exponentially. On white backgrounds, a varying range of lighter colors on the shapes will minimize the clutter.   
  • Avoid color damage. Pure, bright or very strong colors should be used sparingly on or between dull background tones. Light, bright colors should not be mixed with white next to each other. Color spots against a light gray or muted field highlight and italicize data, and also help to achieve an overall harmony. Use colors found in nature to represent and illuminate information since these colors are familiar to the human eye. Gray is regarded in painting to be one of the prettiest, most important and most versatile of colors. 

To create an Infographic, you will need some kind of software, and I believe that Illustrator is the most common software used by the designers of Infographics. One good, and free, replacement software for Illustrator is Inkscape
The YouTube tutorials I've found on the subject are:

December 29, 2012

Books I read in 2012

This year, I've read the following books:
  1. 100 things every designer needs to know about people
  2. A long way gone
  3. Aldrig fucka upp
  4. Coders at work
  5. Crossing the chasm
  6. Den som dödar draken
  7. Don't make me think
  8. Envisioning information
  9. Escape from Camp 14
  10. From Beirut to Jerusalem
  11. From dictatorship to democracy
  12. Glädjedödarna
  13. Great by choice
  14. Handelsmännen
  15. High performance web sites
  16. Historien om IKEA
  17. How to think like a computer scientist
  18. Insanely simple
  19. Jag vill förändra världen
  20. Kon-Tiki
  21. Krigare
  22. Lägg ut
  23. Mindhunter
  24. Minecraft: block, pixlar och att göra sig en hacka
  25. Mining of massive datasets
  26. Moments of truth
  27. No easy day
  28. On writing well
  29. PHP in action
  30. Programming collective intelligence
  31. Pulitzer
  32. Selling in a new marketspace
  33. Skjut inte på journalisten
  34. Steve Jobs
  35. Stenbeck
  36. Tesla: man out of time
  37. The design of everyday things
  38. The lean startup
  39. The miracle of mindfulness
  40. The numbers behind Numb3rs
  41. The publisher
  42. The thank you economy
  43. Total recall
  44. Trust me, I'm lying
Let's hope that I didn't waste any time!

December 27, 2012

The Traveling Santa Problem

I've participated in the "Traveling Santa Problem"-contest over at Kaggle, and the goal of that contest was closely connected to the Traveling Salesman Problem (TSP). The idea behind the TSP is to help a salesman to find the shortest route through a number of cities, and the salesman can only visit each city once. You can use the solutions to the TSP in real-life as well, often connected to different logistics-areas, but also when manufacturing circuit boards.

This is a plot showing the 150,000 different cities (or chimneys in this case):
The easiest way to find a solution is to begin at one random point, and then find the closest point to that random point with the help of Euclidean distance. This is the plot when using that method, using only 5000 chimneys:
As you can see in the plot above, the method using the least distance will miss some points, and in the end, Santa will have to travel back to chimneys very close to chimneys that he has already visited. To solve this problem, you can use an Hilbert curve combined with the least distance method. An Hilbert curve will "snake" its way through all the chimneys, and at each Hilbert-point you can calculate the shortest path through that box. The plot will then look like this using all of the 150,000 chimneys:
It took about 30 minutes to generate this path and the distance traveled was 7,646,647. A random solution would generate a distance of 1,290,678,097. I then tried a Simulated Annealing algorithm to improve the path without much success. The problem is that we have 150,000 chimneys and it takes too long time to generate a better solution.

If you are interested in the Python-code used, you can find it here: github