It's Rishi

Thought streams on the future of tech and media

Archive for the ‘structured data’ tag

The Power of Structured Tweets

with one comment

After about a year of dismissing Twitter as a fad, I realized that it seemed to be gaining more and more momentum. I am receiving more “X is now following you..” emails than ever before and Twitter is finding its way into more of my conversations — both online and offline. While I appreciate the value of Twitter as a communication medium, I recently found a Twitter-based service named StockTwits that revolutionized how I think about Twitter.

StockTwits is a community of people who follow the equity markets and exchange thoughts, via Twitter, about both single names as well as overall market movements. On StockTwits.com, any user can browse all the latest Tweets amongst the community members. Here’s the StockTwits AAPL page:

Now, this concept in and of itself is interesting but not really thought-provoking. However, what I found sort of fascinating is the mechanics of StockTwits. StockTwits users include $[Ticker] in their tweets to let StockTwits know what ticker they are microblogging about. So, for example, “long $RIMM short $AAPL has been a heck of a trade in 09″ indicates to StockTwits that the tweet is relevant to RIMM and AAPL. Because all tweets follow this convention, it is easy for StockTwits to organize the massive number of tweets into channels. In this case, the channel is a single equity name.

Let’s say you started a baseball twittering community. You might create conventions like $[LastName][Jersey#] or $[FirstName][LastName] or whatever.. in fact by applying a bit of intelligence when processing tweets, the system can probably be quite flexible and still correctly resolve player names. The bottom line is that as long as users are OK with including these inline tags in their tweets, systems can then make meaning. Sort of like tagging a post on one’s blog, but the difference being that everyone agrees to use the same set of tags.

In today’s blogosphere, tags are arbitrary. That’s the way it’s always been and this behavior is unlikely to change. The result is that the blogosphere is difficult to aggregate. The only way to create a structure out of related blog posts is through links and trackbacks. While this kind of works (Techmeme is certainly a shining example), there are tons and tons of unlinked posts about the same topic everyday in the blogosphere that, while related, cannot be aggregated.

In contrast, I think there’s a real chance for these twittering tag domains (for lack of a better name for this) to catch on. Tweets don’t really live anywhere per se. Blog posts do…they live on your blog (a web page). Thus, there’s a tendency for people to want to express their blog posts in their own individual way. That means categorizing and tagging the post in their own preferred way. However, for Twitter users to join a conversation on a specific topic, they will need to tag their tweets with a common folksonomy, like we see with StockTwits. Without this concept, a community like StockTwits would be utter chaos.

There’s definitely something interesting about structuring conversations in Twitter. Both for the purpose of making richer experiences for those involved in the same conversation and for the purpose of search/aggregation.

Written by Rishi

January 29th, 2009 at 2:32 am

Posted in Uncategorized

Tagged with ,

Wikipedia makes computers smarter

without comments

Researchers Use Wikipedia To Make Computers Smarter

The idea here is to be able to cluster keywords based on their relevant meanings. The example given in the article is let’s say you’re trying to block those annoying vitamin supplement spam emails. You might set your client to flag emails containing the word “vitamin” as spam. However, let’s say an email comes in with the term “B12″ in it. A human would easily recognize that there is a strong possibility the e-mail is referring to the vitamin B12, but the spam filter – having no instructions for B12 nor the ability to correlate “B12″ to “vitamin” – would allow the e-mail through.

This type of clustering is not new. It has been done many times in the past, including on the Web. However, these technologies have needed to process millions if not billions of web pages to be able to perform such keyword clustering across a wide range of topics. For example, let’s say you have a crawler which has processed 500 million web pages. There is a good chance that in those 500 million web pages, the terms “vitamin” and “B12″ were found together (most likely adjacent to each other, “vitamin B12″). Examples of such pages would be vitamin supplement merchants or health information websites. The crawler, having observed co-occurrences of these two terms, a correlation factor would be developed. Maybe, 70% of the time the term “B12″ was found, the term “vitamin” occurred (The other 30% of the time maybe B12 referred to an apartment number, the name of a rocket, who knows..) So, a spam filter which can perform this kind of analysis would be able to reasonably infer that the this e-mail with the term “B12″ is likely to be related to the term “vitamin” and thus should be flagged as spam.

Again, this type of analysis would only be possible after processing vast amounts of training data – billions of web pages probably. And since web content is uncontrolled, there will be a higher level of chaos in the recorded correlations (e.g. let’s say one of the web pages processed is the key to a crossword puzzle: two completely unrelated terms like “Shakespeare” and “Brett Favre” may be found together).

Using Wikipedia is essentially a massive shortcut. Wikipedia is controlled, it’s 100% high-quality knowledge and is very dense with keywords (there are probably better industry terms for these concepts but I don’t know them). Also, Wikipedia has a beautiful internal link network – articles are connected to one another in many ways. By using Wikipedia as a training set, the amount of computational effort is diminished by orders of magnitude. There is no wasted time and no overlap. Every Wikipedia page is (or trends toward) comprehensive knowledge for a unique topic.

Artificial intelligence, of any kind, relies on humans to train them. As I said, one form of training is the billions of web pages that humans have created. Other ones are human-intensive efforts like the MIT OpenMind CommonSense Project. In a sense, Wikipedia is the most rich training set yet. Even though it was created for the purpose of helping humans, it will help computers (help us) as well. As mentioned in the article, the uses of this are many: search, spam detection, natural language processing, etc. Very, very exciting.

On a side note: Bayesian spam filters, or so called “learning” spam filters, which are now very common operate on a similar principle. Their training set is generally created by each user. As you receive e-mails, marking them as legitimate or spam, the Bayesian spam filter is able to learn which terms lead to a high probabily of spam. These probabilities are refined over time as the user corrects false positives and false negatives. While, these spam filters are generally very effective, they have no ability to deal with e-mails which contain terms it has not seen before. These filters have no knowledge about what terms mean, it’s just storing simple probabilities of terms it has seen before.

Written by Rishi

January 7th, 2007 at 5:20 pm

Posted in Uncategorized

Tagged with , , ,

Who’s self-publishing?

with 4 comments

Over the weekend I had dinner in the city with several friends. During our delicious meal, I was chatting with one of the friends about my recent blog post on Structured Blogging when something occurred to me. At the table was ten young 20-something successful professionals yet I was the only one there that had a blog. On the drive home, I realized that because I spend my days obsessing over the latest tech news/developments, my world is probably very skewed. In fact, a couple weeks ago at a poker game with some friends, I had a side conversation with a friend about Y!’s acquisiton of del.icio.us and nobody else in the room had heard of the news much less had even heard of a website called del.icio.us. When I get emails from friends sharing their recent photos, those emails are coming from old names like ImageStation and SnapFish not Flickr. Today, I had to spell out M-e-e-b-o several times to a friend who was looking for an IM solution to get around his company’s firewall. I could go on and on with examples…but you get my point.

Many of the techie bloggers in the blogosphere have a grossly skewed view of the world. We get so used to this community of early-adopters that focus is lost on the other 99% of the population. However, for a consumer application/service to achieve real success, it is crucial, of course, to capture the mainstream user which represents the overwhelming majority of the market. And, because “Web 2.0″ (I put that in quotes for a reason..heh) apps tend to be community-focused, attracting a wide audience of users is seemingly more important than ever.

One example of this skewed view is the recent talk about decentralized content and the power of self-publishing. The concept of Structured Blogging is built on this principle. As I mentioned in my post on the topic, if I’m going to write a movie review, I want to post it on my blog so I own it and it remains a part of my online identity. Similarly, posts like this predict the end of centralized sites (the author calls them “Walled Gardens”) like Craigslist and eBay because users will inevitably prefer to self-publish their classifieds ads on their own blogs.

The problem with these discussions is the reality that the number of Internet users who blog regularly is tiny. It’s hard to say exactly how many blogs there really are since I don’t trust most of the statistics on # of blogs because lots of people have blogs (sometimes several) but few actually post to it. According to this survey, only 7% of American adults read blogs regularly. If this is true, then Americans who actively publish via blogs has got to be no more than a couple %. Yet all this talk of self-publishing requires one fundamental thing: a place for the self to freely publish on the Web which for most people means having a blog. (Note: I say “freely” publish to exclude sites like MySpace which do limit the format of content that can be published by the user). And just a very small fraction of Americans blog.

From what I’m always reading about, the number of bloggers is rapidly rising so maybe, down the road, models involving decentralized content may become more and more of a reality. But, it does seem that we are not nearly as close as many tech bloggers make it seem.

Written by Rishi

December 21st, 2005 at 4:52 am

Structured Blogging. If only the answer was that simple…

with one comment

First of all, what is Structured Blogging? Right now, blog posts are physically just free-form text entries in plain english paragraphs. But logically speaking, a blog post might be a movie review, an editorial on a recent news bit, description of an upcoming event, etc. While plain old english prose is the optimal mode of comprehension for us humans, machines have a tough time figuring out what the heck you’re talking about unless the content of the entry is tagged or categorized in some way. Structured Blogging is all about incorporating microformats into blog posts in order to structure (aka. tag, but not tagging in the folksonomy sense but tagging in the tagged-data xml sense). Basically, let’s say I posted

“I saw Syriana last night and it was thrilling and though-provoking. Go see it this weekend.”

From these two sentences, you likely had no problem understanding that:
1) Syriana is a movie currently in theaters.
2) I saw Syriana and my review of it is: “thrilling and though-provoking”
3) I am recommending people to go see it.

For a machine to correctly recognize these exact two sentences as a review for a movie named Syriana is difficult. Furthermore, for the machine to find meaning in what I wrote is another problem in itself. Instead, if I published my post using the hReview microformat, a machine could easily recognize that my post is a review for an item – in this case this item is a movie named Syriana – and know what exactly my review is of the item – “thrilling and though-provoking”. Structured Blogging has partnered (it’s not clear how deep these partnerships really are) with all the major blogging tool companies to presumably integrate these formats into the popular blogging software so that the blogger need not know the exact syntax and tags of each format. Tagging your movie review post with the hReview format shouldn’t be more then a click of a few buttons.

Will bloggers use this? Let’s take a minute to understand the motivation of the blogger.

Currently bloggers publish their blogs as a medium for building and expressing their self identity on the Web. When you write something on your blog, it stays with you in one centralized place and becomes part of your e-identity. If I write a product review on Amazon, sure it will get read (in fact it would probably get way more readership than it would on my blog) but that’s not the point. I’m sort of giving away my content. The world doesn’t know who I am on Amazon. Right now, people’s online identities are so fragmented. Pieces of their online expression are happening on many different sites. They might publish some product reviews on Amazon, list some items for sale on Craiglist or eBay, write movie reviews on IMDB, regularly comment on news items on various blogs, chat on various message boards… the list goes on. Sure, all these forms of expression come from me, but because they are completely decentralized they do not form any sort of identity for me. Someone reading my Amazon review of a DVD I bought has no idea about the movie reviews that I’ve written on IMDB. Without a doubt, the ability to keep the content I create on the Web in one spot, published in the way I want is compelling. But blogging already offers this. Why do I need to adopt structured blogging?

The reason is so others can better find the content I produce. If someone is searching or reviews on Syriana, if I have properly tagged my review as such, then there’s a higher chance that a user will find my review. The reason is that the aggregators of the future, while sucking up my blog content, will be able to recognize and precisely record my post as a Syriana movie review. Without this tagging, the only way my content will be located is by search relevancy for the term ‘Syriana’. That’s pretty much hopeless. Besides, someone searching for ‘Syriana review’ won’t even be likely to be given my blog post because I didn’t even put the word ‘review’ anywhere in it. Okay, so if I use Structured Blogging, people will be able to better find my content. Sweet! Well it’s not really that perfect.

These aggregators of the future are going to want to aggregate the content they suck up. You can imagine a movie review aggregator that sucks up all the reviews in the blogosphere, and provides an uber MetaCritic. So users looking for reviews for Syriana will conveniently see “average 4 star rating based on 35 bloggers”. And then of course this aggregator will have advertising and sell movie tickets and essentially be making money off of my and others’ reviews. Is this aggregator compensating me? Nope. They’re just leeching my content and making a buck. The only thing the aggregator can possibly offer me is increased traffic if, in this case, the user wanted to actually read individual reviews of the movie. Is this a fair tradeoff? If I am posting something like a classified ad where it absolutely benefits me to increase its visibility, then there is real monetary value in it for me then the answer is yes. For other situations, the answer becomes tricky. Note: This discussion is very similar to the relationsihp between web publishers and web search engines.

Finally, this topic of structuring content was in the news recently thanks to our friends at Google. A few weeks back GoogleBase launched. Read my post about it. The concept with GoogleBase is very similar: Structure data so that it can be better aggregated. Right now, the only way to input into GoogleBase is directly via a web form (they have different forms for different data types) or via a feed. Either way its the content creator actively submitting it to Google. But, if structured blogging takes off, doesn’t it make a lot of sense for GoogleBase to suck up structured content from the blogosphere? Sure. If there’s structured content anywhere out on the Web, it makes tons of sense for Google to go fetch it. The problem is that right now there is little, if any.

Written by Rishi

December 15th, 2005 at 5:20 am

Google Base: the process of unifying data on the Internet

with one comment

Back in 2000, in an article titled “Not Your Father’s Internet”, Bill Gates wrote

In many respects, today’s Internet actually mirrors the old mainframe model, with the browser playing the role of “dumb terminal.” All the information you want is located in centralized databases, and served up a page at a time (from a single Web site at a time) to individual users. Web pages are simply an HTML “picture” of the data you need, not the underlying data itself.

What Gates is describing here is the fundamental difference between the Internet infrastructure which stores and exchanges raw information and the Web whose purpose is to convey this information to humans.

Currently, for any type of information, there are often multiple sites each with their own database containing information of that type. Let’s take a simple type of information like classifieds, specifically auto classifieds. There are several sites on the Web that have auto classifieds listings: AutoTrader.com, Craigslist, Cars.com, and many others. Now, if you need to search these classifieds to find a 2001 Honda Civic in your area, you will need to go to each site and perform a search. Horrible.

To be more efficient, you could try a classifieds meta-search site like Oodle which will automate the process of searching several classifieds sites for you and return you a single aggregated result. Sure this is a time saver but there are inherent limitations to meta-search engines. Meta-engines do not, of course, have access to AutoTrader’s database or Cars.com’s database, all they can do is crawl and scrape these sites which is an imperfect process. No matter how much intelligence you can build into the scraper, it will never provide a superbly accurate, comprehensive, or up-to-date set of results. There are other limitations like being only able to search the common denominator of information (if Cars.com differentiates between transmission type but AutoTrader.com does not, then Oodle can’t offer transmission-type search refinement).

This same auto classifieds example can be applied to many types of information: product data, job listings, news articles, etc. It is a coincidence that these are the same information types found on Google Base? Of course not.

Ultimately what we humans want is the perfect set of information matching a given search. Any search engine, if limited to searching humanly readable documents (e.g. HTML, PDF, etc.) will never be able to provide perfect information. A better search engine will have access to raw, unadulterated, structured information.

Google Base is simply an attempt to unify the data found in the databases of the world. It’s not sexy, but raw information isn’t sexy. While you and I can add our own data to Google Base, the real power is in the bulk data upload. Imagine if the major classifieds sites continuously uploaded their data to Google Base. Google Base would then become the ultimate classified search. Now, of course, that’s not going to happen so easily because a site like Craigslist, whose value comes entirely from the information in their database (some would argue that Craigslist has other significant value-adds like its user community and simplistic interface), will effectively be putting itself in the fast-lane towards extinction.

However, if eBay were to upload auction listings to Google Base, that would be great for eBay because it would allow Google to more effectively search eBay auction listings. Unlike in Craigslist’s case, it would not threaten eBay’s existence. That’s because for eBay, the auction data is just one part of the puzzle in the auction process. eBay still owns the surrounding processes, like bidding and payment, which are necessary for the auction data to be significant. I doubt Google really has any desire to get into the auction vertical. Google just wants to organize information, not build verticals around this information.

Written by Rishi

November 18th, 2005 at 1:33 am