http://www.informix.com/informix/dbweb/grail/anatomy.htm (PC Press Internet CD, 03/1996)

Anatomy of the Web: A Quick Overview of Architecture and Terminology

Informix-Web Interface

Anatomy of the World Wide Web

The World Wide Web's initial goal was to provide a single, uniform means of accessing hypermedia documents from anywhere on the internet using client/server protocols. The development of graphical user interface- (GUI) based client browsers such as Netscape or NCSA's Mosaic, provided a seamless browser interface that hid most of the complexity of the net. Since then a number of enhancements to browsers and servers have opened the doors to the rapid growth of the Web.

The Web primarily consists of three standards: Uniform Resource Locators (URL), HyperText Transfer Protocol (HTTP), and HyperText Markup Language (HTML). These standards are used by all Web browsers and servers to provide a simple mechanism for locating, retrieving, and displaying information. Web browsers also understand other common internet protocols including FTP, Gopher, and telnet. However, HTTP serves as the communication medium between client and server.

Uniform Resource Locators

A URL is a simple addressing scheme that uniquely identifies a document or file regardless of the protocol. URLs can also identify newsgroups, Gopher menus, and email addresses. There are three elements of a URL: the protocol to be used, the server and port to connect to, and the file path.

The typical format is protocol://server-name[:port]/path.The protocol must be lower case. The server name is case insensitive. If no port is designated, port 80 is assumed. A URL may also include other specific information including arguments to be passed to an application or shell script. Special or Meta characters, are represented by a percent (%) sign followed by the character's hexadecimal equivalent. A Web browser breaks down each URL link request into its constituent components and uses the protocol section to determine how to proceed.

HyperText Transfer Protocol

HTTP is the primary protocol to distribute information on the Web. It is a highly flexible protocol that defines simple transactions between the client Web browser and the HTTP server. The main goal of HTTP was to provide a simple algorithm to enable fast response times. To achieve this, HTTP was defined as a "stateless" protocol, one that does not retain any information about a connection after a request is complete. States may be maintained, however, apart from the server through CGI programs or databases.

With the addition of in-line images, HTTP server performance suffers due to the increased number of separate and individual connections established to return the images. Newer Web browsers maintain an open connection until the whole transaction is complete. Netscape revision 1.1 has added multithreaded connections and cached in-line images to improve performance.

HyperText Markup Language

HTML is derived from Standard Generalized Markup Language (SGML) and is considered the final major innovation associated with the World Wide Web. Unlike typical programming languages, markup languages define areas of textual information by tagging them with a specific format. Tags are defined functionally rather than visually. Each Web browser interprets the tags according to their configuration settings, supported fonts, and windowing environment.

HTML also provides the ability to create HyperText links between documents and parts of documents using the URL. Links provide relationships between documents. A user can follow links from one topic to another regardless of their locations on the Web. Link threads are the foundation of the organizational structure of the World Wide Web.

There are several levels of support for HTML. Current releases include versions 0.9 through 3.1. HTML is backward compatible. If the Web browser does not understand the newer extensions, it ignores them. Netscape 1.1 supports the highest level of HTML extensions.

Client-Server Communications

Each transaction consists of four parts:

The client establishes a connection with the server
The client issues an HTTP request to the server
The server sends a response (i.e., a page or graphics) and a status code
Either the client or server then disconnects

Each request consists basically of a URL and a Request-Method. Each Request-Method communicates a different class of messages to the server. This allows servers to be small, simple programs. Based upon the URL and Request-Method, the server may return a document or execute a CGI program. If it executes a program, the server spawns a process, sets a number of external variables, passes the standard arguments and standard input, and calls that script. Information is returned on the standard output.

The Request-Method is sent as an environmental variable called REQUEST_METHOD. There are five REQUEST_METHOD arguments used by the server to determine how to process a request. In reality, only three are currently in use.

HEAD

The command returns information about a particular document rather than the document itself. It is used primarily by browsers that use caching to retrieve the Last-Modified-Date. If the date is newer than the document in the cache, the newer document is returned.

GET

The GET method is used to return a specific document or run a simple program, and is the most used method. It is capable of accepting a simple command line argument. The HTTP server automatically populates the QUERY_STRING environmental variable and also sends arguments to the standard arguments. The simplest form method, ISINDEX, uses GET. The argument is restricted to about 200 characters before overflowing the buffer.

POST

The POST method is used to transfer data from the client to the server. The server transmits the arguments as a single, continuous string to the standard input stream. It uses an ampersand (&) and the end-of-line marker as argument dividers. It replaces spaces with a plus sign and separates the field name and argument with an equal sign. Finally, special characters are preceded with a percent (%) key and then converted to hexadecimal.

The standard input may consist of one to many arguments. A large form may contain as many as 500 arguments. Each argument consists of a field name, an equal sign, and a field value. Meta characters are received in hexadecimal form and converted to ASCII. A custom set of routines is required to read from the standard input and parse it into useable arguments. Each argument may then be allocated room on the memory heap as a linked list.

Common Gateway Interface

Gateway programs, or scripts, are external executable programs which can be run by the HTTP server. They are external to the server in order to provide maximum flexibility for execution by a number of means.

Gateways conforming to the specification can be written in any language that produces an executable file. Some of the languages include the C shell, the bourne shell, the lkorn shell, PERL, C/C++, TCL, and most 4GL or object languages. Of course, this includes Informix languages, including INFORMIX-4GL, INFORMIX-ESQL/C, and INFORMIX-NewEra.

Methodology

Every program should check to determine the proper method for processing. This can be done by checking the environmental variable REQUEST_METHOD. It should return either GET, HEAD, or POST depending upon the method used to access the program.

The Command Line

For GET programs, earlier HTTP servers used the first command line argument (argv[1]) to present the path information, and the second argument (argv[2]) to return the query string. This information can now be found in the environmental variables PATH_INFO and QUERY_STRING respectively. For non-form GET requests, the query string will always be decoded and placed on the command line.

For POST programs, the first argument (argv[1]) is used to contain the content length. This information has been moved to the environmental variable CONTENT_LENGTH. The query string is not automatically populated. It can be populated by using the cgiutils program provided with the HTTP server, but security may be compromised by using shell scripting languages. It is generally safer and more efficient to parse the information from the standard input.

Web Other Web Resources

From here you can visit other sites which describe various aspects of building Web pages, setting up Web servers, tools and technologies.

World Wide Web FAQ

Comprehensive Guide to Publishing on The Web

Running A WWW Service

NCSA httpd (Overview)

Creating HTML fill-out forms

CGI Scripts

Perl reference material

Perl FAQ

Perl Archive

Perl Manual

Deploying a Database on the Web

Potential Advantages and Challenges

Solutions

What Products Can I Use to Connect Databases to the Web?

[ Home ] [ Table of Contents ] [ Search ] [ Comments ] [ WWW Databases ]