January 30, 2013

Programming Collective Intelligence Chapter 8 Optimization

0 comments
I've been reading the book Programming Collective Intelligence by Toby Segaran. The book gives you a good introduction to the art of Machine Learning and Data Mining. But if you are reading Chapter 8 on optimization, you may have found an error in the code from the section labelled Simulated Annealing (page 95-96), where you are calculating the probability.

This is the code in the book:
p = pow(math.e,(-eb-ea)/T)

But according to the unconfirmed errata and Wikipedia, the code should be:
p = pow(math.e,-(eb-ea)/T)

If you have changed the code and run the optimization program, you will soon discover that this new code will give you a nice "OverflowError: math range error", because -(eb-ea)/T will in the end be a large number. To solve this, you will need to change the code to this:

import decimal
exponent = -(eb-ea)/T
e = decimal.Decimal(math.e)
p = e**decimal.Decimal(exponent)

January 27, 2013

Dow Jones Industrial Average 1897-2012

0 comments
I'm currently working on a more detailed Infographic on the behaviors of the stock market index Dow Jones Industrial Average (DJIA) during World War 2. It seems like Yahoo finance has removed the option to download the data from their website, I've googled the answer and found a strange answer to why they removed it, but that answer is not important here. So, to find the data, I went here: Fred Economic Data, and downloaded it, and you can find yearly closes from the middle of 1896 until today. I haven't seen old data on that index before, so here comes some basic calculations (I've excluded 1896 and 2013 from the data):

Figure 1. Dow Jones Industrial Average 1897-2012 with a logarithmic scale

Figure 2. Dow Jones Industrial Average 1897-2012 - yearly changes

On average, the daily gain during this period was 0.0265 percent, this wasn't a big surprise since the stock market will increase over the long term.

The 10 best days were (the format is year-month-day):
  1. 1933-03-15: 15.34%
  2. 1931-10-06: 14.87 
  3. 1929-10-30: 12.34
  4. 1931-06-22: 11.90
  5. 1932-09-21: 11.36 
  6. 2008-10-13: 11.08
  7. 2008-10-28: 10.87 
  8. 1987-10-21: 10.14 - 2 days after the Black Monday
  9. 1932-08-03: 9.52 
  10. 1939-09-05: 9.52 Germany invaded Poland September 1, and on September 5, US announced it's neutrality in the European conflict
The 10 worst days were:
  1. 1987-10-19:  -22.61%  - the Black Monday
  2. 1914-12-14: -20.53 - This was not a true daily drop since Dow Jones closed in July because of World War 1 and remained closed until December. Since World War 1, the exchange has remained open through wars, natural disasters, and economic crises.
  3. 1929-10-28: -13.47 
  4. 1899-12-18: -11.99 
  5. 1929-10-29: -11.73 
  6. 1931-10-05: -10.73 
  7. 1929-11-06: -9.92 
  8. 1932-08-12: -8.40 
  9. 1907-03-14: -8.29 
  10. 1932-01-04: -8.10

January 18, 2013

How to import a large csv file into MySQL

1 comments
I've been fooling around with big data which in this case is defined as an 188 mb large csv-file. You can't read the file each time you are using it in you favorite programming language since it will take too long time to load it, so what you can do is to insert the data from the file into an MySQL database. You can't import it into MySQL like you would have done with a smaller file since your computer will probably run out of memory, but what you can do is to write a smaller SQL-code and the file will be imported in a matter of  seconds. It took me some time to Google the answer, and hopefully it will be faster to find this post, anyway, here's the magic code:

LOAD DATA LOCAL INFILE 'c:/Projekt/Big Data/events.csv' INTO TABLE database.table FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES;

IGNORE 1 LINES - will skip the first line in the csv file which is the description.

January 13, 2013

Visualize the State of Public Education in Colorado

0 comments
I've just finished my entry for the "Visualize the State of Public Education in Colorado"-competition at Kaggle. The point of this competition is not to create an algorithm or predict something from some data, like the Traveling Santa Problem, but to make the numbers easy to understand with the help of an Infographic.
The background to the competition is as follows. The US State of Colorado wants to improve its schools and have begun to grade each school with a letter ranging from A+ to F. They have now done this for 3 years, and would now like to make some sense of all of these numbers which are currently hidden in many spreadsheets. The easiest way to make numbers easy to understand is to make an Infographic.
To create this Infographic, I've been using Inkscape which is a good, and free, replacement software for the more expensive Illustrator. Most of the colors in the Infographic comes from the State of Colorado flag. The code is written in Python and I've used matplotlib to make the graphs.
According to the book Freakonomics, you can find eight factors that are strongly correlated with school test scores:
  1. The child has highly educated parents
  2. The child's parents have high socioeconomic status
  3. The child's mother was thirty or older at the time of her first child's birth
  4. The child had low birthweight
  5. The child's parents speak English in the home
  6. The child is adopted
  7. The child's parents are involved in the PTA
  8. The child has many books in his home
It is here possible to see if number 2 is a correct one, "The child's parents have high socioeconomic status", by looking at the segment in the Infographic with the label "What about the poor students" and you can clearly see that the percentage of poor students are low in schools with good grades. In the future, the ColoradoSchoolGrades.com should maybe add all of the eight points above to their research.


January 9, 2013

Infographic: The Story of a Swedish Empire of Fashion

0 comments
This is an Infographic covering the Swedish retail-chain H&M ($HM). An Infographic could pretty much be explained as the art of explaining something difficult in a larger picture. This something difficult could be a lot of numbers, or an event such as the sinking of the Titanic. The goal of the Infographic is that someone without any knowledge within the subject the picture is explaining, immediately should get what the picture is about. You can read more about it here: An introduction to data visualization.

I believe that the last part of the Infographic was the most interesting one, you can clearly see that the both competitors GAP and Inditex have more stores, but their profit is still smaller or similar compared with H&M's profit. The map is also interesting, you can clearly see that H&M can expand with more stores, and will expand in 2013 with stores in South America, probably in Peru and Brazil.