Archive for the ‘AI’ tag
Wikipedia makes computers smarter
Researchers Use Wikipedia To Make Computers Smarter
The idea here is to be able to cluster keywords based on their relevant meanings. The example given in the article is let’s say you’re trying to block those annoying vitamin supplement spam emails. You might set your client to flag emails containing the word “vitamin” as spam. However, let’s say an email comes in with the term “B12″ in it. A human would easily recognize that there is a strong possibility the e-mail is referring to the vitamin B12, but the spam filter – having no instructions for B12 nor the ability to correlate “B12″ to “vitamin” – would allow the e-mail through.
This type of clustering is not new. It has been done many times in the past, including on the Web. However, these technologies have needed to process millions if not billions of web pages to be able to perform such keyword clustering across a wide range of topics. For example, let’s say you have a crawler which has processed 500 million web pages. There is a good chance that in those 500 million web pages, the terms “vitamin” and “B12″ were found together (most likely adjacent to each other, “vitamin B12″). Examples of such pages would be vitamin supplement merchants or health information websites. The crawler, having observed co-occurrences of these two terms, a correlation factor would be developed. Maybe, 70% of the time the term “B12″ was found, the term “vitamin” occurred (The other 30% of the time maybe B12 referred to an apartment number, the name of a rocket, who knows..) So, a spam filter which can perform this kind of analysis would be able to reasonably infer that the this e-mail with the term “B12″ is likely to be related to the term “vitamin” and thus should be flagged as spam.
Again, this type of analysis would only be possible after processing vast amounts of training data – billions of web pages probably. And since web content is uncontrolled, there will be a higher level of chaos in the recorded correlations (e.g. let’s say one of the web pages processed is the key to a crossword puzzle: two completely unrelated terms like “Shakespeare” and “Brett Favre” may be found together).
Using Wikipedia is essentially a massive shortcut. Wikipedia is controlled, it’s 100% high-quality knowledge and is very dense with keywords (there are probably better industry terms for these concepts but I don’t know them). Also, Wikipedia has a beautiful internal link network – articles are connected to one another in many ways. By using Wikipedia as a training set, the amount of computational effort is diminished by orders of magnitude. There is no wasted time and no overlap. Every Wikipedia page is (or trends toward) comprehensive knowledge for a unique topic.
Artificial intelligence, of any kind, relies on humans to train them. As I said, one form of training is the billions of web pages that humans have created. Other ones are human-intensive efforts like the MIT OpenMind CommonSense Project. In a sense, Wikipedia is the most rich training set yet. Even though it was created for the purpose of helping humans, it will help computers (help us) as well. As mentioned in the article, the uses of this are many: search, spam detection, natural language processing, etc. Very, very exciting.
On a side note: Bayesian spam filters, or so called “learning” spam filters, which are now very common operate on a similar principle. Their training set is generally created by each user. As you receive e-mails, marking them as legitimate or spam, the Bayesian spam filter is able to learn which terms lead to a high probabily of spam. These probabilities are refined over time as the user corrects false positives and false negatives. While, these spam filters are generally very effective, they have no ability to deal with e-mails which contain terms it has not seen before. These filters have no knowledge about what terms mean, it’s just storing simple probabilities of terms it has seen before.
Are you bored? Make yourself useful!
MIT OpenMind Commonsense Project
Learner (you don’t even need to register for this one)
Peekaboom – It’s a fun game and you can see the fruits of your labor using Peekasearch!
Note: If you’re really bored and want something geek-funny, read the history of the real Mechanical Turk
