June 10, 2008...12:42 am

Wikimachina: Wikipedia for machines.

Jump to Comments

In 1965 H. A. Simon said that ”machines will be capable, within twenty years, of doing any work a man can do.”  Unfortunately, over forty years later his prediction is likely quite far away (if we ever get there).  Artificial intelligence has a long, long history.  Some people have tried to reverse-engineer the human brain while others have used brute force compute power and predictive algorithms in an attempt to ride Moore’s law to enlightenment.  It hasn’t been the complete failure many would have you believe, but it’s far from H. A. Simon’s prediction.

More recently, the W3C has pursued the Semantic Web — I love the way Tim Berners-Lee expressed the vision:

“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.”

– Tim Berners-Lee, 1999

Don’t make computers smart, make humans more precise.

The W3C’s efforts with the Semantic Web are extremely ambitious.  My summary of the Semantic Web is that humans need to help machines get smart.  There is a huge amount of data on the web which machines cannot understand, but humans can.  If we can classify that data in a language that machines can understand, machines will serve us much, much better.  

Resource Description Format sets out to do just that.  Here is an example from the Wikipedia page on RDF, as I cannot think of an easier way to make the point.

Example: The postal abbreviation for New York

Certain concepts in RDF are taken from logic and linguistics, where subject-predicate and subject-predicate-object structures have meanings similar to, yet distinct from, the uses of those terms in RDF. This example demonstrates:

In the English language statement ‘New York has the postal abbreviation NY’ , ‘New York’ would be the subject, ‘has the postal abbreviation’ the predicate and‘NY’ the object.

Encoded as an RDF triple, the subject and predicate would have to be resources named by URIs. The object could be a resource or literal element. For example, in the Notation 3 form of RDF, the statement might look like:

<urn:states:New%20York> <http://purl.org/dc/terms/alternative> "NY"

In this example, “urn:states:New%20York” is the URI for a resource that denotes the U.S. state New York, “http://purl.org/dc/terms/alternative” is the URI for a predicate (whose human-readable definition can be found at [1]), and “NY” is a literal string. Note that the URIs chosen here are not standard, and don’t need to be, as long as their meaning is known to whatever is reading them.

N-Triples is just one of several standard serialization formats for RDF. The triple above can also be equivalently represented in the standard RDF/XML format as:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:terms="http://purl.org/dc/terms/">
        <rdf:Description rdf:about="urn:x-states:New%20York">
                        <terms:alternative>NY</terms:alternative>
        </rdf:Description>
</rdf:RDF>

If humans encode objects with machine-readable tags (RDF example above) and humans create ontologies (open directory projectsocial graph), machines can then use inference to better serve humans (see Guha’s example below).  There are two major categories of projects here, so today I will focus on the first:  encoding entities on the web with machine-readable tags.  I will [try to] tackle the ontology bit in a post later this week.

The challenge with the W3C’s current approach is that it will take an inordinate amount of time to get the key players to adopt the standards required to make the Semantic Web a reality.  This includes not only agreeing to standards, but training thousands, if not millions, of people to leverage these standards.

How does the consumer benefit from all of this?

R.V. Guha developed meta-content format (MCF) at Apple Computer and later Netscape in the mid-1990’s.  MCF was a predecessor to RDF.  Over a decade ago, Guha used this example to demonstrate the power of machine understandable tags.

“Simple lexical word occurrence based searching is by far the most prevalent way of searching for information today. One of the shortcomings of this approach is its inability to distinguish between different word senses. 
 
Example : The user is using one of the WWW search engines (such as Lycos) to search for pages about lions - the animal - not Lion King or Red Lion Hotels or Lions Club. Since all that the search engine is looking for are occurrences of the four characters “lion”, there is no way in which it can distinguish between these different uses of the word “lion”. 
 
How is one to recognize the word sense of a particular occurrence of “lion” without solving the natural language understanding problem? One way of identifying occurrences of “lion” that have a significantly greater likelihood of referring to the animal is to use a subject categorization of the WWW pages. Yahoo, for example, contains a category corresponding to “Animals & Pets”. Pages that occur under this category that use word “lion” are more likely to be using it to refer to the animal. Unfortunately, Yahoo does not index the words that occur in the content of pages. But our program can issue a query to a search engine such as Lycos, translate (”lift”) the answers into a common meta-content language, filter out those pages that don’t occur under the “Animals & Pets” part of the Yahoo hierarchy and give us a small set of pages all of which are most likely about lion the animal.”

Wikipedia, mankind’s greatest achievement?

A few years ago John Beatty argued [to me over coffee] that Wikipedia is one of mankind’s greatest achievements.  Here is some data that supports John’s argument: “currently, the English Wikipedia alone has over 2,404,773 articles of any length, and the combined Wikipedias for all other languages greatly exceeded the English Wikipedia in size, giving a combined total of more than 1.74 billion words in 9.25 million articles in approximately 250 languages. The English Wikipedia alone has over 1 billion words, over 25 times as many as the next largest English-language encyclopedia, Encyclopædia Britannica…”

And here is the competition. I’m with John.

Idea — Wikimachina:  Wikipedia for machines.

Wikipedia has assembled the world’s greatest collection of encyclopedic knowledge with [according to an article on Wikipedia, of course] 4,000 editors who represent 64,567,607 total edits, with an average of 16,141 edits per editor. This accounts for 32.8% of the 196,705,582 total edits made to the English Wikipedia.  The idea behind Wikimachina is to have a relatively small group of humans (like Wikipedia) help tag all content on the web with metadata so that developers can build much smarter applications for humans.

1.  Seed the system with as much tagged data as possible.

Leverage APIs, scrape, and do deals to get data from any service that already has good tagging data — StumbleUpon, Delicious, Digg, Reddit, Flickr and so on… 

2.  Develop peer-to-peer browser toolbar for the community.

Develop a toolbar that allows the Wikimachina community to tag any entity on the web with machine readable tags.  All edits would be instantly sent to the distributed P2P index.  So when a Wikimachina community member visits a web page that has been tagged by a fellow community member, the latest change would be visible to him.

3.  Build community, develop community tools.

If we could get 4,000 editors to produce 16,141 tags per editor, the web would be a very different place.  Editors would need to learn some basic rules about our tagging system, but it wouldn’t be any harder than HTML.  

A system which separates the creation of these incremental metadata tags from the creation of content has some very positive characteristics.  Distributed systems are prone to spam, which is exactly what happened with HTML tags in search (because incentives for content owners to mislead search engines with metadata tags are so high).  So giving a loyal group of semantic “taggers” the tools to use their judgement would likely avoid many of the perils of collocating the creation of content with the creation of semantic tags.

4.  Offer access to the index through web services.

Just as with Wikipedia, anyone would be able to view the index — which would also be available through a set of web services.  Any third party developer could leverage these tags to develop or improve their service (search, for example).  In addition to machine understandable tags, our service would allow for bulk download of data — required for machines, and prevented in today’s tagging systems developed with humans in mind.

4 Comments

  • You should really check out Freebase.com, they cover a lot of the points from this idea in a very user-friendly way. They have seeded their database from Wikipedia and other sources, they have an open API, and they are building a strong community and a powerful tool-set around their data. They even do full database dumps so you can re-use the data however you like.

    P.S. I don’t work for them, I’m just think they’re doing a lot of things right.

  • Interesting idea, but what is the incentive for users to tag data? Wikipedia offers users the ability to share knowledge with other users, which is something some people like to do (for various reason). Providing semantic data for machines seems somehow much less gratifying…

  • Eran,

    You nailed it.

    I’m sure this would work technically. And once you have the metadata, I’m sure developers would use it, resulting in consumer adoption. The big question is how do you get thousands (or tens of thousands) of people to do the hard work? Wikipedia has managed it, but it’s non-profit (I think this idea would share knowledge with others, too).

    So three possibilities:

    1. Make it non-profit. Building an open-source repository of metadata would attract the same type of people who edit Wikipedia, develop Linux, and make MySQL great.

    2. Focus this idea on corporations. Doing this within an enterprise should provide users with the incentive — your efforts would make your corporation more productive. This would lead to higher equity value, higher productivity, and consequently higher job stability.

    3. Figure out a way to give contributors credit, or share whatever revenue you generate as a consumer-focused corporation with taggers (like About.com, Epinions, Y! Answers).

  • I’ve always felt that no matter the computing power AI has available to it, it will never match humans until it has the same richness of inputs and outputs. We have the five senses with which to explore, learn, and create memories of our environment. AI does not have these rich inputs, nor does it have the complex set of muscles to control. Wikimachina could be a great start to creating a rich set of inputs for AI. Imagine realtime news from humans, meteorological, and geological sensors, etc… This data might even allow computers to find some rather subtle trends we never noticed before, perhaps exposing things such as corruption.

    So basically you’re suggesting the beginnings of Skynet. I like it.

Leave a Reply