January 30, 2013

Programming Collective Intelligence Chapter 8 Optimization

I've been reading the book Programming Collective Intelligence by Toby Segaran. The book gives you a good introduction to the art of Machine Learning and Data Mining. But if you are reading Chapter 8 on optimization, you may have found an error in the code from the section labelled Simulated Annealing (page 95-96), where you are calculating the probability.

This is the code in the book:
p = pow(math.e,(-eb-ea)/T)

But according to the unconfirmed errata and Wikipedia, the code should be:
p = pow(math.e,-(eb-ea)/T)

If you have changed the code and run the optimization program, you will soon discover that this new code will give you a nice "OverflowError: math range error", because -(eb-ea)/T will in the end be a large number. To solve this, you will need to change the code to this:

import decimal
exponent = -(eb-ea)/T
e = decimal.Decimal(math.e)
p = e**decimal.Decimal(exponent)

January 18, 2013

How to import a large csv file into MySQL

I've been fooling around with big data which in this case is defined as an 188 mb large csv-file. You can't read the file each time you are using it in you favorite programming language since it will take too long time to load it, so what you can do is to insert the data from the file into an MySQL database. You can't import it into MySQL like you would have done with a smaller file since your computer will probably run out of memory, but what you can do is to write a smaller SQL-code and the file will be imported in a matter of  seconds. It took me some time to Google the answer, and hopefully it will be faster to find this post, anyway, here's the magic code:


IGNORE 1 LINES - will skip the first line in the csv file which is the description.

January 13, 2013

Visualize the State of Public Education in Colorado

I've just finished my entry for the "Visualize the State of Public Education in Colorado"-competition at Kaggle. The point of this competition is not to create an algorithm or predict something from some data, like the Traveling Santa Problem, but to make the numbers easy to understand with the help of an Infographic.
The background to the competition is as follows. The US State of Colorado wants to improve its schools and have begun to grade each school with a letter ranging from A+ to F. They have now done this for 3 years, and would now like to make some sense of all of these numbers which are currently hidden in many spreadsheets. The easiest way to make numbers easy to understand is to make an Infographic.
To create this Infographic, I've been using Inkscape which is a good, and free, replacement software for the more expensive Illustrator. Most of the colors in the Infographic comes from the State of Colorado flag. The code is written in Python and I've used matplotlib to make the graphs.
According to the book Freakonomics, you can find eight factors that are strongly correlated with school test scores:
  1. The child has highly educated parents
  2. The child's parents have high socioeconomic status
  3. The child's mother was thirty or older at the time of her first child's birth
  4. The child had low birthweight
  5. The child's parents speak English in the home
  6. The child is adopted
  7. The child's parents are involved in the PTA
  8. The child has many books in his home
It is here possible to see if number 2 is a correct one, "The child's parents have high socioeconomic status", by looking at the segment in the Infographic with the label "What about the poor students" and you can clearly see that the percentage of poor students are low in schools with good grades. In the future, the ColoradoSchoolGrades.com should maybe add all of the eight points above to their research.