||The Eurovision Song Contest Analyzed
Post - Wednesday, May 14th, 2014 a
The results of the 2014 Eurovision Song Festival may be of interest from a number of perspectives, e.g. artistic, political and cultural. Here I will focus on the last point and show how by simple pattern recognition tools the cultural similarities between the participating countries can be analyzed. At the end it will be discussed the question whether Austria is the natural winner.
In the final round of the contest 26 songs have been judged by the peoples of 37 countries. A weighted voting system was used in which 1, 2 ,3, 4, 5, 6, 7, 8, 10 or 12 points had to be distributed over the best 10 songs of their choice. The remaining 16 songs received just 0 points. This constructs a set of 37 vectors in a 26 dimensional vector space. Every vector characterizes the preferences of the people of one of the voting European countries. Our assumption is now that there is a cultural similarity between two countries if their voting vectors are similar. Countries with with a very different cultural background are expected to be represented by very distant vectors.
A complicating aspect is that countries are not allowed to give points to their own song. Such a restriction only holds for the 26 countries with a song in the final round. A cultural very similar country with no song in this round does not have this restriction. I decided to correct this to some extend by changing the obligatory 0 points for the home country to 12 points, the maximum, assuming that its submission perfectly matches the cultural preference of the nation. This changes the lengths of just a part of the vectors. To correct for this, all vectors have been normalized to a unit length.
As a distance measure the Hellinger distance is used as it shows a good performance in many applications with just positive vector components (e.g. for spectra and histograms). The resulting distance matrix is shown on the right. The dDistances between a country and itself is zero, of course. Small distances are dark, large ones are light.
The matrix as shown in this way is a mess and needs a further analysis. Before that, already some observations can be made. Some countries like Austria, Latvia and Sweden show many small distances:dark lines. They are similar to a lot of other countries. On the other hand, countries like Armenia and Italy show many larger distances and seem to be (very) different from most others
If we average all numbers in a row, the average distance of a country to all other countries will result. When they are sorted, countries in the center of the cloud of vectors will be on top and outliers, countries far away from everybody, will be at the bottom. The top of this ranking is given by:
Austria Sweden Estonia Portugal Latvia Netherlands Finland ….
and the end by:
… Malta Italy San-Marino Belarus Montenegro Albania Armenia Azerbaijan.
This ranking should be interpreted as that the first countries have many similar ones and that the countries at the end of the list have just a few.
An interesting table might be derived by by finding the nearest neighbor of every country in the space. These are the countries that pop up if the rows of the above table are sorted:
|Country||neighbor 1||neighbor 2||neighbor 3||neighbor 4||neighbor 5|
This table shows for every country the 5 countries that have voted in the most similar way. Some of the results are as to be expected, e.g. see Armenia, Finland or Russia. Others, like San-Marino are rather surprising. This might be caused by a small population and thereby a high noise level. Interesting are the mutual nearest neighbors that can be observed. These are the countries that are the nearest neighbor of their nearest neighbors, thereby constructing clear twins. Some of them can be directly understood, but are still striking that they are there, e.g. Sweden-Iceland, Russia-Belarus, Latvia-Estonia and Hungary-Austria. The connection Spain-Belgium, on the other hand, is far from obvious, at least for me.
A drawback of the above analysis is that it considers the local neighborhood of every vector determined by the distance to always 5 neighbors. For high density areas this will be too small, for others too large. It might be better to create a picture. The vector space under consideration, however, is 26-dimensional. 2-dimensional scatterplots can be made, but have, obviously, strong limitations. Thereby it is definitely needed to use several procedures. Below two results are shown, one based on principal component analysis (PCA, left) and one on multi-dimensional scaling (MDS, right).
Globally they are similar. It was needed to align the two plots by rotation as the MDS (right) is rotation invariant. Still there are some differences, e.g. the locations of Russia, Belarus and Greece. The two 2D projections both reflect just partially the multi-dimensional configurations. The PCA (left) is a linear projection that minimizes total error, roughly the sum of the differences between the visible and the true distances. So, some distances might be correct, others very wrong. The MDS is a non-linear procedure that minimizes the error expressed in the ratios between the visible and true distances. It thereby distributes the error more uniformly over all distances, avoiding very large ones.
A procedure to visualize the original configuration in another way is by hierarchical clustering. It results in a dendrogram, a binary tree graph in which each node two clusters are merged, starting from the individual vectors and merging finally all in one big cluster. In each step the two clusters having the smallest distance are merged. In the complete linkage procedure that is presented below the distance between two clusters is defined as the distance between their two most remote vectors.
If the merging of clusters is stopped when just three clusters are left, there are two smaller ones (pink and dark blue) and one big one. In order to have more evenly sized clusters, the largest cluster is split again in three clusters (light blue, green and yellow) by considering its merging at a somewhat lower level. In this way five clusters are found:
- San-Marino, Ukraine, Moldova, Israel, Georgia, Russia, Belarus, Azerbaijan, Poland, Armenia
- Germany, France, United-Kingdom, Netherlands, Finland, Sweden, Iceland, Denmark
- Switzerland, Slovenia, Lithuania, Norway, Ireland, Romania, Portugal, Spain, Belgium
- Greece, Latvia, Estonia, Hungary, Austria
- Malta, Italy, Montenegro, Macedonia, Albania
Cluster 1 has clearly an East-European orientation, while cluster 2 contains many countries from West- and North- Europe. Cluster 5 is closely related to the Adriatic Sea. Cluster 3 is really a mixture of everything. I leave it to the sociologists and cultural anthropologists to draw further conclusions if any.
The result of the clustering has been used to reorder the list of countries. Reshuffling the rows and columns of the original distance matrix accordingly, results in a distance matrix as presented in the image on the right. Here the structure in the data is much better visible than in the original matrix above.
From this one-day project a few questions arise. First, how stable and significant are the relations and clusters that have been found? Results of other years need to be analyzed to answer this question. In addition, the tools and measures that have been used are chosen rather arbitrary: the Hellinger distance, the cluster analysis, the data visualization tools. Other ones might be considered.
Finally, is it by accident that the winner of the contest, Austria pops up as the most central country, because its average distance to all other participants is minimum? It is likely that the central person in a group, being best able to communicate with all others, will also be the most popular one. From that perspective Austria has won the competition because its culture, as defined here, is the most central in Europe. However, this conclusion has to be rejected, as it would imply that Austria will win (or end up close to that) every year, which is not true. Here we have a typical example of overtraining. As we define culture by the preferences in the contest, conclusions cannot be used for predicting results of the same contest.
The Matlab m-file used in preparing this blog can be downloaded from here. It uses PRTools version 5.1.1.