Introduction

The Web has influenced the way people communicate and collaborate. We can publish information on the Web, making it accessible to anyone with access to the Web, or we can use the Web as a source of information to derive new knowledge. The amount of information that is accessible on the Web has increased enormously in a short period of time. This increase of information is a desirable evolution, but it has also made the problems with the Web more evident. Everyone that has used the Web to search for information knows that it is not as easy or as fast as one would like it to be. Using today's search engines to find information is no longer a process of entering keywords and receiving a list of answers: It has also become a process of searching long lists of mostly useless answers to find the good answers.

Since there is so much information present on the Web, a simple search could return a huge list of answers. Of course, if you make a wide search in a big pool of information the list of sources that contain the requested information will always be extensive, and this is generally not a problem - it is an obvious result. The problem on the Web is that narrowing a search on today's search engines leaves out good answers. To narrow the search you normally decrease the set of possible answers by adding keywords that only could be present in the documents that one seeks. But on the Web, as the searches are basically only done with keywords in a global information space, this does not work. Generally, when keyword searches are used there is a problem with synonyms (car vs. vehicle), different languages (bil vs. car), and polysemy (search-engine vs. car-engine). There is also a problem in searching for e.g., poems, because one seldom uses the keyword "poem" in a poem. This has the effect that searches with keywords are, evidently, not effective to use when searching among information sources that are highly different in terms of language, cultural context, or domain etc. as on the Web. Then, the question of why almost every search engines uses keyword indexes often appears. The reason for using keyword indexes is that it is today the only way to automatically create a search-service by letting robots traverse the Web, collecting information (keywords) and throwing it into a database. This is cheaper than manually categorizing pages, because no humans are involved, and faster because a robot could index a page faster that a human could categorize it. But why keyword index? To answer that question one has to consider the formats that are used to represent information on the Web. Some of the most common text formats are HTML, PDF, PostScript, or pure ASCII text (putting multimedia aside for now). These formats all represent the information encoded as a serialization of natural language and are intended for human consumption via a presentation tool. So the only way to make that information searchable is to collect the individual words and use them in a keyword index because a computer sees the string of characters not as a word but as a string of bytes. Indexing a Web page is merely a process of throwing the sting of bytes delimited by the space-character byte encoding into a database. As this should indicate: we cannot today (with limited resources) use computers to create other search-services than keyword services, and this is becoming a big problem. The easiest solution to this problem is not to teach computers to behave as humans and "understand" natural languages, but to change the way information is expressed [Berners-Lee, RM].

There is another problem with the information on the Web that is a bit harder to see: it lacks precision. If I use the name Gore in a text, e.g. "the topic of this page is Gore", how does a viewer know that it is the author's dog that the page is about? The pervious statement lacks precision because it is impossible to know who Gore is in that statement. Before the Web, precision of statements in e.g. books was not that problematic. A book covers a certain topic in a domain and is written in a special context. Thus, it uses words that have a perhaps implicit but (hopefully) always exact meaning that could be deduced by the books topic, domain and context. The Web could be seen as filled with information from every book that exists. This means that the precision is lost due to the mixing of topic, domains, and contexts and this has a number of negative effects. One is that it becomes harder to find statements (or information generally) about the thing that you search for (e.g. a Web page that contains a specific topic) and it makes it harder to collaborate because the communication could be ambiguous. If I would like to find information about the politician Gore, and uses a basic search engine, the statement in the previous example ("the topic of this page is Gore" - about the dog) would make the page about the dog Gore appear in the list of answers. (If I add the keyword "politician" I will miss pages that contain the "statesman" keyword.)

This problem of precision also relates to another drawback of the information on the Web: it is very time-consuming to use information from more than one Web page to derive knowledge. This is easiest to explain with an example. If I enter a keyword searching for information about the town "Lichavon" I will receive information that has that keyword. But if there is a page (that the search-service trusts) somewhere on the Web that states that "Lichavon" is also called "Objissjf", that information should have indicated to the search-service that I also search for information that has the keyword "Objissjf" (of course offered as an option). This type of deduction has to be manually made by the human that uses the search engines.

The problem of precision is a problem that comes from the fact that information is not adapted to the global information space that the Web is before it being published. We need a way of disambiguating information for it to become truly useful on the Web. It is easy to move information to the Web, but it is almost impossible to preserve its full value if the information is not prepared for being published on the global medium that the Web is. Because the context, domain, and meaning are implicit, they get lost in the universal information space. We need to explicitly express information in a way that explicitly describes the domain, context, and hopefully meaning.

Computers out-performance humans in one sort of activity: their indefatigable ability to search and perform well-defined manipulations on well-defined data. How is it then possible that the service we get from computers is so low when computers indefatigable ability to search and make small inferences is what we need on the Web? The simple answer is that the information on the Web is not well-defined data that computers could process. They can read it, but they don't get it. To a computer this text is simply a string of bytes, with no semantics. If it is a word processor, it might be able to deduce that this is a sentence, but it has now clue what the sentence is about. It is a bit frustrating that computers can't help us with such seemingly easy tasks as finding information or making small deductions from the Web. And generally, it would be extremely useful if computers could use information on the Web as if it was hand-coded into their applications. There are of course numerous advantages with having computers that use the Web as their information source; only imagination sets the limits.

Instead of considering the Web as an information space for humans, it could also become a database of well-defined data that machines could use to help us. It could be simple applications such as shopping agents, but also heuristic applications that spans the Web deducing and guessing on information about the stock market and actually buys and sells stocks on our behalf - acting as automatic and personal around-the-clock-brokers. Perhaps this will be the end of the stock market - both terrifying and appealing - but the point is that almost anything becomes possible. With the information pool that exists on the Web today computers can only process it blindly, not making any use of it. Again the easiest solution to this problem is not to teach computers to behave as humans, but to change the way information is expressed.

All the previously mentioned problems stem from fact that the information is not adapted to the global information space that the Web is, and that it is targeted at human consumption. The information is intended for direct human use, after using some rendering software, and the information is simply text where the semantics is stated through the use of natural languages. If we manage to make part of the information easier for machines to process we could use it better to search for and deduce information from the Web. This does not involve natural language processing, it is done by attaching some well-defined data to the information that computers could use as a hint of how they could use the information. This would help us humans to better deal with the huge amount of information that is present on the Web. If we continue to produce more and more information on the Web, without changing its format, it will become harder and harder to use. We need to change the abstract information space into an abstract space where computers can help manage and use the information.

Ironically, the Web that makes it easier to transmit information has made it more difficult to share information [Sowa, 2000].

References

[Berners-Lee T, RM] Berners-Lee T, 1998: Semantic Web Road map, URL: http://www.w3.org/DesignIssues/Semantic.html
[Sowa, 2000] Sowa J, 2000: Ontology, Metadata, and Semiotics, Published in B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical Linguistics, and Computational Issues, URL: http://www.bestweb.net/~sowa/peirce/ontometa.htm

Created with AmayaValid HTML