Archive for the ‘search’ tag
Google Trends predicted the Iowa Caucus?
Back in July 2006, I wrote a post titled Predicting the Future With Google Trends in which I described how Google Trends, which measures the relative search volume of keywords on Google, could have been used to reveal the relative popularity of real-world phenomena, such as who will win American Idol.
Out of curiosity, I pulled up 30-day trailing data for both the republic and democratic presidential candidates and compared it to the actual Iowa Caucus results yesterday.




Is it just me or was Google Trends a remarkably accurate predictor of yesterday’s result! I tried specifying trend data for only Iowa however it seems as if there isn’t quite enough data to draw any meaningful conclusion.
Is Knowledge Overrated?
Last week I was channel surfing while eating dinner in front of the TV. After failing to find anything interesting, I gave up and settled on Jeopardy. I used to watch Jeopardy sometimes as a kid and I recall being amazed at how smart the better contestants were. Not only did they have vast amounts of knowledge, but more importantly could recall it in just one or two seconds. While watching the show last week, I found myself equally amazed as before. One particular contestant fired off answers to 15th century European history questions (or more correctly “questions for answers”) so quickly I thought he must have just finished authoring a book on the subject.
By the time the Final Jeopardy round came, I was done with dinner and in front of my laptop. The question (err answer, whatever..) was given and (of course) I had not the faintest clue. For the heck of it, I typed some relevant keywords into Google and bingo! I solved Final Jeopardy. It turned out that two of the three contestants solved it correctly also. Sure, they couldn’t use Google to find the answer, but at the end of the day, in the real world, I would be equally as effective as these two contestants. This realization sparked me to think about the value of knowledge.
What’s the purpose of knowledge? My answer would be that knowledge allows one to perform a task more efficiently. Any task. The task could be conquering the latest shoot-em-up video game, cooking dinner, or solving a complex mathematics proof. Each of these tasks can be performed faster if the individual performing that task has relevant knowledge in the respective domain. The problem though is that for knowledge to be useful, it’s not sufficient that you had once gained that knowledge. Instead, for it to be useful, you not only need to have once gained it, you also must be able to recall it both accurately and in a timely manner. Without speed and accuracy in recollection, having knowledge is useless.
How useless? The test is this: Your brain or an electronic/digital means, which is faster? Which will give me the info I need faster, more accurately, and more consistently? Nine times out of ten, my answer to this question is the latter. Honestly, I wonder why I even bother remembering anything. Phone numbers? Got my cell. Favorite restaurants? Got Yelp. CS stuff? Google works just fine. Forgot who my friends are? Got my AIM buddy list, MySpace and Facebook. How to get home from work? Got a navigation system. How to spell my name? Outlook auto-corrects it. Ok, I think you get my point…
The thing you need to remember is that all the electronic information sources I just mentioned came very recently. You think finding this information is easy now? Trust me, it’ll get easier. I’m just waiting for Google to announce a search plugin for your brain. Sounds ridiculous but is it really that crazy to imagine such a device might be available in my lifetime?
Now, you might argue that humans are capable creatures because of our intelligence, not simply our knowledge. Intelligence implies not just semantic knowledge but the ability to combine building-blocks of knowledge into composite forms of knowledge and, ultimately, to innovate. Innovation, after all, is a hallmark of human civilization. Innovation implies a certain higher level of thought which only a human can perform. You could say that freeing our minds of the burden of knowledge management will allow our mind to focus on innovation and other forms of higher-level thought.
But this implies that computers cannot perform high-level thought. Computers can be given “intelligence”. It is very common today to program computers to make sophisticated decisions based on input data. Due to complexity, or other limitations, many of such decisions were once thought impossible for a computer to make. A classic example is chess. A couple hundred years ago, the thought of a chess-playing machine was just a big joke. Ten years ago, IBM’s Deep Blue computer beat the greatest chess player of our time.
But can computers innovate? If you look up the word “innovation”, the word “new” is mentioned repeatedly: new ideas, new dimensions, something new, etc. Convention has it that computers cannot think “outside of the box”. While computers can perform sophisticated logic and are able to “learn” patterns, they can’t really form new thought . A recent example of this is an article I was reading about Monitor110. They have developed some proprietary technology that allows their software to scour niche information sources on the Web (blogs, message boards etc.) and pick out potentially market-moving news before it hits the mainstream. So, their software can pick out the bits of signal from the noise, but it cannot determine if and how to act on information to bring financial reward and, moreover, outperform the rest of the market (the common term is “generate alpha” in the alternative-investment world). The formation of a unique investing strategy can only be performed by the human investor staring at the computer screen. The investor may utilize computer-based modeling tools to aid in development of the strategy, but the high-level strategy still is up to him to devise.
Will computers one day be able to innovate? Maybe. If and when scientists are better able to model the human brain, it may turn out that deep-down, it is, in fact, a deterministic system. If that is the case, it may be possible to model the human brain electronically.
Until this day, though, I do think, on the basis of the test I put forth earlier, that much of the knowledge in people’s brains is truly useless. Instead of just giving students knowledge, it is more important to teach them how to efficiently find knowledge when the situation demands it.
“Give a man a piece of knowledge and you feed him for a day. Teach a man to locate knowledge and you feed him for a lifetime.” Yeah I know..I’m a dork.
Ok, it’s way past my bedtime again. Maybe I’ll continue this thought in a later post…
The #1 lesson the Web has taught me
I’ve gained incredible amounts of knowledge from the Web. Being a self-described “knowledge whore”, I’ve spent countless hours on sites like Wikipedia and howstuffworks.com as well as other sources of knowledge like blogs, newsgroups, and forums. Almost without fail, though, every time I find myself digging deeper into a topic, I quickly realize that the topic is WAY more complex than I had imagined it to be. Try it sometime. Pick a topic and Google it. For even the most obscure topic, the sheer vastness of relevant information on the Web is mind-boggling.
I realized today that even though the Web has given me volumes of knowledge and wisdom, above all, the Web has taught me this:
You don’t know what you don’t know.
The Web bombards us with this lesson because it’s so damned efficent at information retrieval. In minutes we can open gateways to knowledge sources that might have taken hours or days before. More importantly, though, the highly-linked nature of the Web supports a breadth-first search pattern of knowledge gathering. You might be reading about sub-topic A and in the middle of a paragraph follow a link to sub-topic B and so on. I’m sure you’ve done this plenty of times. While your initial intent may have been to perform a linear search to ascertain information on a specific topic, before you know it, you’ve spent an hour reading about 10 different sub-topics. In one hour, you’ve gotten a broad, but relatively shallow understanding of several sub-topics.
If you had been performing research thru offline methods, you would have found an information source (a book, news article, thesis, etc.) on a single sub-topic and digested it thoroughly before continuing on to the next source. This pattern of information gathering is more similar to depth-first search. Using this method, in the same time as above, you may gain relatively complete knowledge of 2 sub-topics, but not even realize the existence of the 8 other sub-topics that you would have encountered if you had followed a breadth-first search pattern. In other words, you’ll know more about less. With the Web, you’ll know less about more. The curse of the latter is that you will have learned of the existence of many more topics which only further increases the magnitude of how little knowledge you have.
Anyway, it’s late and it’s likely that I’m just rambling, so I’ll cut this post off now. In conclusion, even though the Web has given me tons of knowledge, the most valuale knowledge it’s given me is the realization of how little knowledge I actually have. My guess is that by the time I’m an elderly man, instead of feeling old and wise, I’m going to feel old and dumb. Very humbling…
SMS rate-increases spell trouble for SMS-based mobile services
Back in December, I found a not-so-nice surprise in my Sprint cellular bill. Sprint had quietly raised the cost of an SMS text message from 10 cents to 15 cents per message. I complained to customer service that this was a breach of my original contract. However, after a couple of e-mail exchanges, it was clear that I would be stuck with the rate increase. I later found out that Cingular had also raised their SMS rate to 15 cents. It is also expected that both Verizon and T-Mobile will follow with increases of their own in order to keep their ARPU (Average Revenue Per User) competitive.
So how does this rate increase affect the typical cellular customer? Well, the typical American customer is not affected because he/she does not text much, if at all. Of course, this rate increase will only serve to discourage these users from embracing SMS. For the millions of customers who do use SMS to communicate with friends and colleagues, my hunch is that the 5 cent increase will not result in a significant change in their usage. Let’s face it, most cellular customers will hardly take notice at the extra couple of dollars on their monthly statement. The carriers know this and that’s why they don’t seem to be afraid of customer backlash. The unsympathetic reply I got from Sprint’s customer service supports this.
The real losers of SMS rate increases are companies who provide SMS-based services. One such company is 4INFO. 4INFO provides consumers with easy access to information like stock quotes, sports scores, flight status, and the weather. They utilize a simple, “natural language” query interface (e.g. “weather 94304″ or “49ers nfl”). Very useful. One of the primary challenges 4INFO faces is the lack of SMS adoption here in the US. In foreign markets, SMS is cheap (or even free) compared to voice airtime so it is very popular, even amongst older-age cellular customers. 4INFO is quick to emphasize in their marketing that the service is free. While it is true that 4INFO itself does not charge for their service, the cellular carrier is charging for SMS access. With SMS costing 15 cents per message, a simple roundtrip to 4INFO costs 30 cents. Out to dinner with the girlfriend but want to check the football score every half hour? 4INFO works perfectly for this. The problem is that by the time the game is over, you’ll have paid a couple bucks in SMS fees. That’s pretty expensive. There’s a good chance I would utilize this service if I was in a pinch, but I couldn’t afford this luxury on a daily basis.
What’s the alternative to SMS? Internet access on your mobile device. Carriers have successfully been upselling 2.5G and 3G data plans for the past couple of years. Now we’re seeing devices which support Wi-Fi and Wi-Max isn’t that far away. Combined with the trend in mobile devices of offering ever-more rich display screens and sophisticated Internet software applications, we’re slowly going to see a convergence between the way we access information on our PC’s and the way we access the same information on our mobile devices. As Internet access becomes ubiquitous on mobile devices, services like SMS will quickly becomes extinct. Need to send a quick message to another person? Use IM or e-mail. Need to get alerts? RSS. Etc, etc.
For now, even in light of Sprint’s rate increase, I doubt my SMS behavior will be altered. I will, however, take solace in the expectation that SMS will soon be a distant memory.
Note: I do think there is a future for companies like 4INFO. To be able to understand and satisfy short-hand queries like “UAL SFO JFK” (results in a timetable for United flights from SFO to JFK) is very valuable in the mobile context where keypads and displays are tiny. My expectation, however, is that they will find ultimate success in the future piggybacking off of Internet technologies rather than SMS. Hopefully these companies will manage to stay afloat until then.
Wikipedia makes computers smarter
Researchers Use Wikipedia To Make Computers Smarter
The idea here is to be able to cluster keywords based on their relevant meanings. The example given in the article is let’s say you’re trying to block those annoying vitamin supplement spam emails. You might set your client to flag emails containing the word “vitamin” as spam. However, let’s say an email comes in with the term “B12″ in it. A human would easily recognize that there is a strong possibility the e-mail is referring to the vitamin B12, but the spam filter – having no instructions for B12 nor the ability to correlate “B12″ to “vitamin” – would allow the e-mail through.
This type of clustering is not new. It has been done many times in the past, including on the Web. However, these technologies have needed to process millions if not billions of web pages to be able to perform such keyword clustering across a wide range of topics. For example, let’s say you have a crawler which has processed 500 million web pages. There is a good chance that in those 500 million web pages, the terms “vitamin” and “B12″ were found together (most likely adjacent to each other, “vitamin B12″). Examples of such pages would be vitamin supplement merchants or health information websites. The crawler, having observed co-occurrences of these two terms, a correlation factor would be developed. Maybe, 70% of the time the term “B12″ was found, the term “vitamin” occurred (The other 30% of the time maybe B12 referred to an apartment number, the name of a rocket, who knows..) So, a spam filter which can perform this kind of analysis would be able to reasonably infer that the this e-mail with the term “B12″ is likely to be related to the term “vitamin” and thus should be flagged as spam.
Again, this type of analysis would only be possible after processing vast amounts of training data – billions of web pages probably. And since web content is uncontrolled, there will be a higher level of chaos in the recorded correlations (e.g. let’s say one of the web pages processed is the key to a crossword puzzle: two completely unrelated terms like “Shakespeare” and “Brett Favre” may be found together).
Using Wikipedia is essentially a massive shortcut. Wikipedia is controlled, it’s 100% high-quality knowledge and is very dense with keywords (there are probably better industry terms for these concepts but I don’t know them). Also, Wikipedia has a beautiful internal link network – articles are connected to one another in many ways. By using Wikipedia as a training set, the amount of computational effort is diminished by orders of magnitude. There is no wasted time and no overlap. Every Wikipedia page is (or trends toward) comprehensive knowledge for a unique topic.
Artificial intelligence, of any kind, relies on humans to train them. As I said, one form of training is the billions of web pages that humans have created. Other ones are human-intensive efforts like the MIT OpenMind CommonSense Project. In a sense, Wikipedia is the most rich training set yet. Even though it was created for the purpose of helping humans, it will help computers (help us) as well. As mentioned in the article, the uses of this are many: search, spam detection, natural language processing, etc. Very, very exciting.
On a side note: Bayesian spam filters, or so called “learning” spam filters, which are now very common operate on a similar principle. Their training set is generally created by each user. As you receive e-mails, marking them as legitimate or spam, the Bayesian spam filter is able to learn which terms lead to a high probabily of spam. These probabilities are refined over time as the user corrects false positives and false negatives. While, these spam filters are generally very effective, they have no ability to deal with e-mails which contain terms it has not seen before. These filters have no knowledge about what terms mean, it’s just storing simple probabilities of terms it has seen before.
A question for you all
I’ve been doing some Kurzweil-inspired thinking lately and I have a question for you all:
What percentage of the knowledge in your brain can be found on the Web? In other words, let’s say you were able to express all the knowledge in your brain as statements of fact. What percentage of those statements would you be able to find on the Web?
Follow up question:
Think about that percent of knowledge that cannot be found on the Web. What kind of knowledge is it? What does it pertain to?
I have my own set of answers to these questions which I will be sharing in an upcoming mini-essay I’m writing, but I was hoping that some of you might post a comment with your own answers to these questions.
Thanks for your help!
How search engines build advertising space
In my prior post, I briefly noted how search-engines/aggregators profit off of content that publishers create for them. I shared my thoughts on this topic with a couple friends today and while nothing discussed was enlightening, I figured I might as well throw up a quick post. Since Google is the largest search engine, I’ll use them as an example.
Google is so profitable because they control lots of Web advertising space. They build control of advertising space 3 ways (listed in order of descending profitability):
Inherit – Google inherits ad space from every person who has ever published any sort of Web content and made it accessible. The easy way to think about it is that the more content there is on the Web, the more possible search result pages exist. Each search result page has advertising space. Cost: Developing and operating the search engine which is expensive. But, on a per (ad space) unit basis, the cost is very tiny fractions of a penny.
Create – Google creates ad space by building it’s own applications and sticking AdSense on it. The goal is to build applications/services which result in disproportionately large # of page views and whose pages would likely to contain content for which there would be advertising interest for. E.g. Search, GMail, Maps (for the purpose of Local info and advertising), Froogle. Cost: Same logic as above but the per unit cost is not quite as small because page view volume is not nearly as high compared to Search. Also AdSense ad space is not as lucrative as AdWords because it is less targeted.
Buy – Offering AdSense to publishers. Cost: % of ad revenue. Much less profitable compared to the prior two methods.
While each of these three methods have varying profit margins, they all share one thing in common: making ad space out of content that other people create. It’s the exact opposite of traditional media companies. It’s such a great business to be in and that’s why entrepreneurs get excited about the notion of structured data on the Web because it means they can build great aggregators and get rich. =)
Quick blabber about Google:
Google began life as a search company. Google built a fantastic rapport with its users by offering a tremendously useful, free service. Since then, they have evolved into an advertising company. When Google added advertising, they were able convince users that the ads actually helped the user because the ads were contextual. All advertising is contextual to some extent, but people aren’t thanking TV networks for airing pizza commercials during football games. The commercials help you figure out what to eat while watching the game… right? Google has become the only advertising company in the world that people love. It’s both brilliant and fascinating.
Google Base: the process of unifying data on the Internet
Back in 2000, in an article titled “Not Your Father’s Internet”, Bill Gates wrote
In many respects, today’s Internet actually mirrors the old mainframe model, with the browser playing the role of “dumb terminal.” All the information you want is located in centralized databases, and served up a page at a time (from a single Web site at a time) to individual users. Web pages are simply an HTML “picture” of the data you need, not the underlying data itself.
What Gates is describing here is the fundamental difference between the Internet infrastructure which stores and exchanges raw information and the Web whose purpose is to convey this information to humans.
Currently, for any type of information, there are often multiple sites each with their own database containing information of that type. Let’s take a simple type of information like classifieds, specifically auto classifieds. There are several sites on the Web that have auto classifieds listings: AutoTrader.com, Craigslist, Cars.com, and many others. Now, if you need to search these classifieds to find a 2001 Honda Civic in your area, you will need to go to each site and perform a search. Horrible.
To be more efficient, you could try a classifieds meta-search site like Oodle which will automate the process of searching several classifieds sites for you and return you a single aggregated result. Sure this is a time saver but there are inherent limitations to meta-search engines. Meta-engines do not, of course, have access to AutoTrader’s database or Cars.com’s database, all they can do is crawl and scrape these sites which is an imperfect process. No matter how much intelligence you can build into the scraper, it will never provide a superbly accurate, comprehensive, or up-to-date set of results. There are other limitations like being only able to search the common denominator of information (if Cars.com differentiates between transmission type but AutoTrader.com does not, then Oodle can’t offer transmission-type search refinement).
This same auto classifieds example can be applied to many types of information: product data, job listings, news articles, etc. It is a coincidence that these are the same information types found on Google Base? Of course not.
Ultimately what we humans want is the perfect set of information matching a given search. Any search engine, if limited to searching humanly readable documents (e.g. HTML, PDF, etc.) will never be able to provide perfect information. A better search engine will have access to raw, unadulterated, structured information.
Google Base is simply an attempt to unify the data found in the databases of the world. It’s not sexy, but raw information isn’t sexy. While you and I can add our own data to Google Base, the real power is in the bulk data upload. Imagine if the major classifieds sites continuously uploaded their data to Google Base. Google Base would then become the ultimate classified search. Now, of course, that’s not going to happen so easily because a site like Craigslist, whose value comes entirely from the information in their database (some would argue that Craigslist has other significant value-adds like its user community and simplistic interface), will effectively be putting itself in the fast-lane towards extinction.
However, if eBay were to upload auction listings to Google Base, that would be great for eBay because it would allow Google to more effectively search eBay auction listings. Unlike in Craigslist’s case, it would not threaten eBay’s existence. That’s because for eBay, the auction data is just one part of the puzzle in the auction process. eBay still owns the surrounding processes, like bidding and payment, which are necessary for the auction data to be significant. I doubt Google really has any desire to get into the auction vertical. Google just wants to organize information, not build verticals around this information.
AIM Bots – Don’t delete them!
I’m sure most of you AIM users noticed that when you logged in today, you had a couple new items in your buddy list under the group “AIM Bots”. I noticed it as well and, like you, my first instinct was “wtf?”. I was just about to delete them from my buddy list when my curiosity got the best of me and I opened up a chat window with MovieFone and said ‘hello’. I was greeted with:
its Rishi: hello
MovieFone: Hey there. Just ask type a film name, actor or director any time and I’ll tell you what’s playing.
I proceeded to play around with it, typing in a sequence like ‘Jarhead’, ‘94304′ and got local movie showings for the movie Jarhead. The text interface is very easy to follow and response time is minimal. If you’re an AIM user (you don’t need the AIM client, 3rdParty clients work fine) I highly recommend you check it out next time you’re looking for movie showings. You’ll save a lot of time versus having to load up the movie website of your choice and searching for the same information.
They also launched a shopping bot called ShoppingBuddy. I was somewhat less impressed by it but still it is an interesting attempt at a text-based shopping search.
The most intriguing aspect is that anyone can create their own bot relatively easily using AIM chat API’s which are available in many languages. Here’s an example of a simple Perl-based Amazon AIM Bot. This rudimentary example only accepts an ASIN as input, queries Amazon via their REST interface, and spits out product details. Given the fact that most of Amazon’s db is accessible via REST, this simple bot can easily be expanded upon.
For information like sports scores, stock quotes, traffic reports, etc. where it is straightforward to understand what information the user is asking for based on their input (‘94304′ into a weather bot, ‘Yankees’ into a sports score bot, ‘GOOG’ into a stock quote bot), I think such chat bots are an efficient and convenient medium – more so than the Web.
Side Note:
One thing I have yet to try is using these AIM bots via my cell phone (all the carriers have some way to access the AIM service even if it’s just a clumsy SMS-based interface). It seems like this is an interesting alternative to “mobile search” services like 4INFO, a startup which provides information access via SMS. While services like these supposedly employ AI techniques to understand what the user is asking for, these companies plan to drive revenue thru advertising which will be delivered along with the reply message. As someone who would get ADBLOCK as my personalized license plate if it was available, I could imagine myself preferring to use advertising-free sources of information even if they were less sophisticated. Of course there are other issues facing mobile search like the fact that most carriers still charge for SMS. This fact alone has kept me using WAP as my source of mobile information. (Yes I know most carriers also charge for data access but Verizon + 3rdParty WAP gateway = free WAP for me!)

