Archive for the ‘knowledge’ tag
Is Knowledge Overrated?
Last week I was channel surfing while eating dinner in front of the TV. After failing to find anything interesting, I gave up and settled on Jeopardy. I used to watch Jeopardy sometimes as a kid and I recall being amazed at how smart the better contestants were. Not only did they have vast amounts of knowledge, but more importantly could recall it in just one or two seconds. While watching the show last week, I found myself equally amazed as before. One particular contestant fired off answers to 15th century European history questions (or more correctly “questions for answers”) so quickly I thought he must have just finished authoring a book on the subject.
By the time the Final Jeopardy round came, I was done with dinner and in front of my laptop. The question (err answer, whatever..) was given and (of course) I had not the faintest clue. For the heck of it, I typed some relevant keywords into Google and bingo! I solved Final Jeopardy. It turned out that two of the three contestants solved it correctly also. Sure, they couldn’t use Google to find the answer, but at the end of the day, in the real world, I would be equally as effective as these two contestants. This realization sparked me to think about the value of knowledge.
What’s the purpose of knowledge? My answer would be that knowledge allows one to perform a task more efficiently. Any task. The task could be conquering the latest shoot-em-up video game, cooking dinner, or solving a complex mathematics proof. Each of these tasks can be performed faster if the individual performing that task has relevant knowledge in the respective domain. The problem though is that for knowledge to be useful, it’s not sufficient that you had once gained that knowledge. Instead, for it to be useful, you not only need to have once gained it, you also must be able to recall it both accurately and in a timely manner. Without speed and accuracy in recollection, having knowledge is useless.
How useless? The test is this: Your brain or an electronic/digital means, which is faster? Which will give me the info I need faster, more accurately, and more consistently? Nine times out of ten, my answer to this question is the latter. Honestly, I wonder why I even bother remembering anything. Phone numbers? Got my cell. Favorite restaurants? Got Yelp. CS stuff? Google works just fine. Forgot who my friends are? Got my AIM buddy list, MySpace and Facebook. How to get home from work? Got a navigation system. How to spell my name? Outlook auto-corrects it. Ok, I think you get my point…
The thing you need to remember is that all the electronic information sources I just mentioned came very recently. You think finding this information is easy now? Trust me, it’ll get easier. I’m just waiting for Google to announce a search plugin for your brain. Sounds ridiculous but is it really that crazy to imagine such a device might be available in my lifetime?
Now, you might argue that humans are capable creatures because of our intelligence, not simply our knowledge. Intelligence implies not just semantic knowledge but the ability to combine building-blocks of knowledge into composite forms of knowledge and, ultimately, to innovate. Innovation, after all, is a hallmark of human civilization. Innovation implies a certain higher level of thought which only a human can perform. You could say that freeing our minds of the burden of knowledge management will allow our mind to focus on innovation and other forms of higher-level thought.
But this implies that computers cannot perform high-level thought. Computers can be given “intelligence”. It is very common today to program computers to make sophisticated decisions based on input data. Due to complexity, or other limitations, many of such decisions were once thought impossible for a computer to make. A classic example is chess. A couple hundred years ago, the thought of a chess-playing machine was just a big joke. Ten years ago, IBM’s Deep Blue computer beat the greatest chess player of our time.
But can computers innovate? If you look up the word “innovation”, the word “new” is mentioned repeatedly: new ideas, new dimensions, something new, etc. Convention has it that computers cannot think “outside of the box”. While computers can perform sophisticated logic and are able to “learn” patterns, they can’t really form new thought . A recent example of this is an article I was reading about Monitor110. They have developed some proprietary technology that allows their software to scour niche information sources on the Web (blogs, message boards etc.) and pick out potentially market-moving news before it hits the mainstream. So, their software can pick out the bits of signal from the noise, but it cannot determine if and how to act on information to bring financial reward and, moreover, outperform the rest of the market (the common term is “generate alpha” in the alternative-investment world). The formation of a unique investing strategy can only be performed by the human investor staring at the computer screen. The investor may utilize computer-based modeling tools to aid in development of the strategy, but the high-level strategy still is up to him to devise.
Will computers one day be able to innovate? Maybe. If and when scientists are better able to model the human brain, it may turn out that deep-down, it is, in fact, a deterministic system. If that is the case, it may be possible to model the human brain electronically.
Until this day, though, I do think, on the basis of the test I put forth earlier, that much of the knowledge in people’s brains is truly useless. Instead of just giving students knowledge, it is more important to teach them how to efficiently find knowledge when the situation demands it.
“Give a man a piece of knowledge and you feed him for a day. Teach a man to locate knowledge and you feed him for a lifetime.” Yeah I know..I’m a dork.
Ok, it’s way past my bedtime again. Maybe I’ll continue this thought in a later post…
Wikipedia makes computers smarter
Researchers Use Wikipedia To Make Computers Smarter
The idea here is to be able to cluster keywords based on their relevant meanings. The example given in the article is let’s say you’re trying to block those annoying vitamin supplement spam emails. You might set your client to flag emails containing the word “vitamin” as spam. However, let’s say an email comes in with the term “B12″ in it. A human would easily recognize that there is a strong possibility the e-mail is referring to the vitamin B12, but the spam filter – having no instructions for B12 nor the ability to correlate “B12″ to “vitamin” – would allow the e-mail through.
This type of clustering is not new. It has been done many times in the past, including on the Web. However, these technologies have needed to process millions if not billions of web pages to be able to perform such keyword clustering across a wide range of topics. For example, let’s say you have a crawler which has processed 500 million web pages. There is a good chance that in those 500 million web pages, the terms “vitamin” and “B12″ were found together (most likely adjacent to each other, “vitamin B12″). Examples of such pages would be vitamin supplement merchants or health information websites. The crawler, having observed co-occurrences of these two terms, a correlation factor would be developed. Maybe, 70% of the time the term “B12″ was found, the term “vitamin” occurred (The other 30% of the time maybe B12 referred to an apartment number, the name of a rocket, who knows..) So, a spam filter which can perform this kind of analysis would be able to reasonably infer that the this e-mail with the term “B12″ is likely to be related to the term “vitamin” and thus should be flagged as spam.
Again, this type of analysis would only be possible after processing vast amounts of training data – billions of web pages probably. And since web content is uncontrolled, there will be a higher level of chaos in the recorded correlations (e.g. let’s say one of the web pages processed is the key to a crossword puzzle: two completely unrelated terms like “Shakespeare” and “Brett Favre” may be found together).
Using Wikipedia is essentially a massive shortcut. Wikipedia is controlled, it’s 100% high-quality knowledge and is very dense with keywords (there are probably better industry terms for these concepts but I don’t know them). Also, Wikipedia has a beautiful internal link network – articles are connected to one another in many ways. By using Wikipedia as a training set, the amount of computational effort is diminished by orders of magnitude. There is no wasted time and no overlap. Every Wikipedia page is (or trends toward) comprehensive knowledge for a unique topic.
Artificial intelligence, of any kind, relies on humans to train them. As I said, one form of training is the billions of web pages that humans have created. Other ones are human-intensive efforts like the MIT OpenMind CommonSense Project. In a sense, Wikipedia is the most rich training set yet. Even though it was created for the purpose of helping humans, it will help computers (help us) as well. As mentioned in the article, the uses of this are many: search, spam detection, natural language processing, etc. Very, very exciting.
On a side note: Bayesian spam filters, or so called “learning” spam filters, which are now very common operate on a similar principle. Their training set is generally created by each user. As you receive e-mails, marking them as legitimate or spam, the Bayesian spam filter is able to learn which terms lead to a high probabily of spam. These probabilities are refined over time as the user corrects false positives and false negatives. While, these spam filters are generally very effective, they have no ability to deal with e-mails which contain terms it has not seen before. These filters have no knowledge about what terms mean, it’s just storing simple probabilities of terms it has seen before.
A question for you all
I’ve been doing some Kurzweil-inspired thinking lately and I have a question for you all:
What percentage of the knowledge in your brain can be found on the Web? In other words, let’s say you were able to express all the knowledge in your brain as statements of fact. What percentage of those statements would you be able to find on the Web?
Follow up question:
Think about that percent of knowledge that cannot be found on the Web. What kind of knowledge is it? What does it pertain to?
I have my own set of answers to these questions which I will be sharing in an upcoming mini-essay I’m writing, but I was hoping that some of you might post a comment with your own answers to these questions.
Thanks for your help!
“How to Make Wealth” by Paul Graham
Out of all the RSS feeds that i subscribe to, Paul Graham’s Essays is my favorite. When I see a new item on his feed, I usually pause what I’m doing and read it. His latest essay this month is titled “How to Make Wealth” and it discusses the difference between money and wealth and how understanding this difference is fundamental to understanding entrepreneurship. Some of my favorite quotes:
Someone graduating from college thinks, and is told, that he needs to get a job, as if the important thing were becoming a member of an institution. A more direct way to put it would be: you need to start doing something people want. You don’t need to join a company to do that. All a company is is a group of people working together to do something people want. It’s doing something people want that matters, not joining the group.
To get rich you need to get yourself in a situation with two things, measurement and leverage. You need to be in a position where your performance can be measured, or there is no way to get paid more by doing more. And you have to have leverage, in the sense that the decisions you make have a big effect.
The problem with working slowly is not just that technical innovation happens slowly. It’s that it tends not to happen at all. It’s only when you’re deliberately looking for hard problems, as a way to use speed to the greatest advantage, that you take on this kind of project. Developing new technology is a pain in the ass.
This is a good plan for life in general. If you have two choices, choose the harder. If you’re trying to decide whether to go out running or sit home and watch TV, go running. Probably the reason this trick works so well is that when you have two choices and one is harder, the only reason you’re even considering the other is laziness. You know in the back of your mind what’s the right thing to do, and this trick merely forces you to acknowledge it.
Information overload
It’s approaching 3AM right now and I’m not asleep. In fact, over the past year, my sleeping time has gotten later and later and later. Why you ask? Partly it’s because I’ve been busy working on my startup Dontbuyjunk and I’m often working late into the night until I’m satisfied with the progress that I’ve made for the day. But, I’m increasingly finding that what really is preventing me from getting to bed is information overload courtesy of the Internet. Let me explain.
I’ve been spending hours per day on the Internet for several years now. The big difference though is that recently the time I spend is shifting away from entertainment (mindless chatting on message boards, gaming, etc.) to information exchange activities such as reading/writing in the blogosphere. Every night, after I’m done working, I do one last catch up with my RSS reader and almost without fail, I end up spending a couple hours bouncing from one blog to the next and then to aggregators like del.icio.us and memeorandum.
Today, publishing (via the Web) is essentially free. And when I say “free” I mean that it both has no cost and is without rules or barriers. Furthermore, the second you publish your content, it is instantly accessible to a billion people. Because of all this, the rate at which information id created and disseminated is astonishing. So this is a good thing right?
Well…sure. enabling people to express and share both knowledge and opinions is great for society in countless ways. The problem that develops is that with so much publishing going on, how can I keep track of that tiny subset of information that is relevant, unique (remember that the majority of content published everyday is either syndication or basically duplicate) and valuable in my world? It’s getting harder by the day. Further exacerbating my problem is the wanting to not just read the facts behind a topic/news bit, but also read the opinions and participate in the many insightful discussions that branch from it.
So what’s the solution to my problem? Lunesta? Maybe. The next-generation of aggregators? Bingo.
One big trend that we are starting to see develop and I believe will be a major area of focus in the years to come is in information filtering and aggregation. Search engines like Google and centralized information sources like ESPN and Wikipedia allow me to pull in specific pieces of information when I am actively seeking it. However, their limitation stems from the fact that most of the information I absorb on a daily basis is new and could not have been searched for. In other words, if I didn’t know the information existed, how could I have searched for it? Instead, I must rely on my set of trusted sources to push this new information to me. Information aggregations, either human-derived (digg, reddit, del.icio.us) or algorithmic (memeorandum, blogniscient, Google News), are a step in the right direction. But aggregators have a long way to go before they truly are accurate and encompassing tools for information.
Anyways, it’s now 4:30AM and I’m basically just blabbing. Aggregators is an area that I’m becoming increasingly interested in myself and I have some of my own ideas brewing in my head about what the perfect aggregator would be and how it would work. I’ll be thinking and blogging about it in the coming weeks.
For some more discussions on aggregators, check out a blog post on memeorandum I was reading earlier that I found insightful:
http://mashable.com/2005/11/08/hacking-memeorandum-more-proof-that-algorithms-dont-work/
Be sure to read the comments thread.

