Mathemagics

3 07 2008

Came across this TED talk by Arthur Benjamin demonstrating phenomenal mathematical calculations. He starts with squaring two digit numbers and then progresses to three, four and even five digits. Another trick demostrated is guessing (correctly of course) the day of the week, given a date. He has been featured in Readers Digest as well.





Clustering large datasets

2 07 2008

I was reading about the BIRCH clustering algorithm which is used to cluster large data sets. Its quite popular in the database community. I will try to summarize the paper here.

The algorithm tries to address the problem of clustering large data sets, where the entire dataset cannot fit into memory. The popular clustering algorithms and some flaws with them in the context of large datasets are:

Probability based approaches: These assume that probability distributions on separate attributes are statistically independent of each other.

Distance based approaches: These assume that all data points are available at once and can be scanned frequently.

The primary problem is that algorithms dont consider the fact that the dataset can be too large to fit into the main memory. BIRCH is specially suited for large datasets.  BIRCH can typically find a good cluster with a single scan of the data and improve the cluster quality with additional scans. Its I/O is linear in the size of the data.

The keystones of BIRCH are the concepts of Clustering Features(CF vector) to describe a cluster and the CF Tree. A triplet is used to summarize each cluster as a CF vector, comprising of a) Number of points in cluster b) Linear sum of vectors in cluster c) Square sum of vectors in cluster. The CF vecor is efficient because it stores much less than all the datapoints in the cluster. The representation is optimal because it allows for all the calculations needed to make clustering decisions in BIRCH.

A CF Tree is a balanced tree used to store the clusters in a hierachial manner. The algorithm for insertion of points into the tree ensure that we end up with a good cluster at the end of a first scan of the data. I refer the reader to the paper for details of this data structure.

Clustering is now performed on the CF tree, which being a much reduced representation of the dataset, does fit into the main memory. There might be flaws in the clustering because of skews in the input order of the data. For example, the same data point, if inserted twice at different times might end up in differnt leaves of the tree. These anomalies are taken care of in the post processing steps.

An overview of the clustering is found in the figure below:





Muhammad Yunus on social entrepreneurship

17 06 2008

I came across a very moving talk by Muhammad Yunus at Stanford GSB.  He begins the talk with details about the inspiration for the microcredit revolution. Yunus was a professor of Economics at Chittagong university which has several villages around it. He did a survey in a village and found that a total of 42 people had borrowed 27$ from the moneylender (at extraordinary rates)!! Since this seemed to be a problem which he could personally solve, he started by giving loans to these people for investing into income generating activities. Initially he tried borrowing money from banks, offering himself as a guarantor because the banks did not consider the poor as credit worthy. However haggling with the banks was a very tedious process and he decided to start a bank for this purpose. Grameen Bank currently has 3.5 million borrowers, 95% of them being women. The average loan size is about 200$.

Yunus then moves on to talk about other social changes he tried to bring in.  He encouraged the “grameen families” to educate their children. This was a big step forward because most of the borrowers were illiterate. Student loans and scholarships were instituted to help students pursue higher education.

When the Bangladesh government was liberalizing the telecom sector, Grameen phone was created. It is the largest telecom company in Bangladesh now. A drive to bring the telephone to the villages was started. Currently there are 55 thousand “telephone ladies” in the villages of Bangladesh. They get loans to purchase the mobile equipment, and then offer the service of communication to villagers for a cost. Since some of the villages dont have electricity, solar panels have been installed to charge these mobile phones. This has turned out to be a very successful business for the borrowers.
Another social initiative was to try and bring beggars into mainstream society. Grameen Bank introduced beggars to local merchants, guaranteed upto 2000 Takas per beggar so that they could borrow goods from the merchants. These beggars became a link between the shopkeeper and the household. According to Yunus, it was a big hit in Bangladesh because the women are not encouraged to step out of their houses. Thus the beggars could start earning by providing the service as a middle-man.

This concept of addressing poverty through business is mind blowing. Philantrophy money is always limited, and the effort is to make the poor self sufficient. Yunus’ ideas are really simple, sometimes downright obvious, yet the kind of change they have brought about it profound.





J K Rowling:Harvard commencement speech

12 06 2008

A really nice speech by J K Rowling at the Harvard commencement. Video and full text of the speech is presented in the link. She begins in a mirthful note before talking about the importance of failures and that of imagination.





Barack Obama speech

10 06 2008

The first speech by Barack Obama that I have come across. Given at the graduation ceremony of Southern New Hampshire University (in 2007), the speech has some lessons to learn about leadership and making decisions in life.

The key messages as summarized here are:

  • The world doesn’t just revolve around you” — you have to learn to see things through other people’s eyes.”
  • “Challenge yourself. Take some risks in your life.”
  • “Persevere”.




Google Trends for exit polls!!

9 06 2008

Came across this nice study on Slashdot. It tries to analyse the correlation between the US presidential caucauses and the Google trends for the candidates during this period. A nice effort at using sometime widely available to draw demographic conclusions.





Face recognition for advertising

5 06 2008

Bill boards have hitherto lacked the capability to provide accurate metrics which is available along with internet advertising these days. However that might soon change, if some companies are to have their way.

An article on NYTimes talks about two companies, TruMedia and Quividi, which are providing solutions to provide metrics for people who look at a billboard. They install billboards with cameras, and then use face recognition to provide age and gender information of the people who looked at the billboard. There is talk of providing the racial information of viewers as well, to further target ads. The goal would be “to show one advertisement to a middle-aged white woman, for example, and a different one to a teenage Asian boy“. Such solutions will provide much needed desired metrics which define the effectiveness of a billboard advertisement.

The ad is equipped with a camera that gathers details on passers-by.

The ad is equipped with a camera that gathers details on passers-by.





A supercomputer on your desk!!

1 06 2008

A research group at the University of Antwerp has developed a PC consisting of gaming hardware (NVIDIA GeForce 9800 GX2) . It uses 4 dual GPU graphics cards, each consisting of 128 subprocessors and the total setup costs about 4000 Euros. It uses Nvidia’s CUDA technology to perform computations in parallel (this is the key to exploiting the maximum mileage out of the system). This machine is called FASTRA, and can deliver the same performance as a supercomputer consisting of hundreds of PCs. The group does research on computational tomography which is extremely computation intensive. Its a nice breakthrough and would hopefully be useful for research labs short on cash. Here is a video from the group detailing the FASTRA





New image recognition algorithm

27 05 2008

A paper from MIT to appear in CVPR 2008 tries to push the frontiers of object recognition. One of the lessons from modern search engines is that very simple algos give remarkable performance by using data on an internet scale. However, to apply such an approach to object recognition is computationally intractable..right from the task of even downloading 80M images to experimenting with this huge dataset.

This is the motivation to find shorter numerical representations that can be derived from an image that will provide a useful indication of its content. A short representation for an image will allow for real time solutions for object recognition tasks, which are otherwise extremely computationally intensive. As described in this news article, it has been shown than representing images as a compact binary code (as few as 256 bits per image) captures the essential contents of an image. With these codes, a database of 12.9 million images takes up less than 600MB of memory. This can easily fit on the memory of a PC. Thus such a representation of images would enable real time searches over millions of images on a single PC. The object recognition results obtained in the paper (with the short codes used to represent an image) are comparable to the results obtained using full descriptors. Because of the information lost by reducing the bits representing an image, complex or unusual images are less likely to be correctly matched. But for the most common objects in pictures–people, cars, flowers, buildings–the results are quite impressive.

Che Guevara

The same project also created a cool app called the visual dictionary. A list of all nouns in the English language was obtained from Wordnet. Images for each word were obtained by Google image search. Each tile in the image above is the average of 140 images. This average represents the dominant visual characteristics of each word. The average could be a recognizable image (as Che Guevara above) or a colored blob. The tiles are arranged such that the proximity of two tiles is indicative of their semantic distance. Thus the poster explores the relationship between visual and semantic similarity.





SIGGRAPH animation festival

25 05 2008

The nominees for the ACM SIGGRAPH computer animation festival awards have been announced. Here is a list of the nominees. The winning shorts from 2007 can be viewed online here. Most of these clips are mind blowing.