Introduction to CGI Programming

Nick Johnson

This document seeks to provide the reader with a brief but informative introduction to CGI programming. Perl will be used as the language of choice in most examples, so some knowledge of perl will be helpful, but it is not absolutely required. Also, to fully understand most of what I'll be saying, you need to know enough about Unix to navigate directories and change file permissions. I'm also assuming you know how to code.

What CGI is

CGI is an acronym for "Common Gateway Interface" and simply defines a way for web clients to communicate with programs. A CGI program is nothing particularly magical; it is simply a program that produces valid output for a given MIME type, which may be based on some input. For example, most CGI programs produce output of the type text/html, which is the HTML that we've all grown to love.

The first thing to be aware of is where you may execute a CGI in terms of its directory. Many ISPs and corporate web servers will allow you to execute CGIs in any web directory. Some will restrict you to a particular cgi-bin directory. Once you've determined this, you should also keep in mind that the web server must be able to read and execute your code, so you will most often need to set the file mode to 755.

Hello, World

Let's jump right in with the world's most nauseatingly repetitious programming example, Hello, World. Just for fun, I've included Hello World in several languages. CGI doesn't care what language your program is written in so long as the output contains valid headers (read on for more explanation of this).

Shell Script

Perl

C

As you might imagine, much of a CGI program is printing data back out to the user.

Notice that at the beginning of each program, we send the line Content-type: text/html followed by two newlines. This is done in order to follow the HTTP protocol. The first newline terminates the Content-type header, and the second terminates all of the headers. In this simple case (and in fact, in the common case this is also true) we are only providing the Content-type header. Even if you send other headers, you must always send the Content-type header.

The Content-type isn't always text/html. Common other types include text/plain, application/x-pdf, and so-on.

Rant

Let me take a moment to rant briefly about printing out sound HTML. It is really easy when coding to produce lousy HTML by skipping tags, leaving off end tags, and that kind of thing. It's also easy just as a coder in a hurry to leave out important details of HTML; something that comes to mind immediately is the IMG ALT parameter. It is a good idea to look at your output in several browsers, including one that doesn't handle graphics (like lynx). Remember that your audience may include visually impaired people who won't like hearing "image image image image link link link link" when they view your page.

Using Forms and Accepting Input

Now that we've touched on output briefly, let's take a look at reading user input for the CGI. There are two means of sending input to the CGI: GET and Post.

Data in either Get or Post operations has non-alphanumeric data encoded. Spaces are translated into plus signs (+). All other non-alphanumeric characters are translated into a percent sign (%) followed by the hexadecimal representation of the ascii value of that character. For example, the string "space tab newline" (where we've followed each word by its character equivalent) would translate into "space+tab%09newline%0D%0A". Usually you don't have to think about this, but it is something to remember if, for example, you need to pass a percent sign in a Get URL. In that case, you just substitute the hex-encoded percent sign (%25).

The data is actually passed into the CGI using environment variables and/or STDIN. The means used to send the data is placed in the REQUEST_METHOD variable, which will contain either GET or POST if there is user input.

Get

The Get method passes query information along in the URL of the CGI. For example, http://somehost/neato.cgi?var1=val1&var2=val2.... The values are separated from the CGI name with a single question mark, the variable/value pairs are separated from each other by ampersands (&) and the variables are separated from their values by equals signs.

When using the GET method, the REQUEST_METHOD variable will be set to GET and the data you are interested in will be in the QUERY_STRING environment variable.

Post

Post information is sent to the CGI program on STDIN. The data is formatted exactly the same way as with GET. All post data is sent on a single line, so you should only read STDIN once or you will find your program hanging waiting for more data.

Chances are you won't have to worry about the exact implementation of receiving input into your program. There are libraries in existence already that will handle parsing CGI information for you.

Forms

I'll just touch on forms very quickly here since coding HTML is outside the scope of this paper. The important things to know are to specify the type of request operation you want in the <FORM> tag, that the NAME of each form element becomes a CGI variable, and that the HIDDEN input type is your friend.

An example

Here is a quick example HTML page containing a form, and a CGI that does nothing more than print out all the environment variables so you can see what's actually going on.

form.html

(execute)

dumpvars.cgi

Try posting various pieces of information in the form and also try executing the CGI program with your own query string on the URL.

Parsing CGI Input with Perl

Many people prefer to use CGI.pm's query objects to handle CGI variables. I don't, and find query objects rather annoying. For some things, objects are overkill, IMO. Fortunately for us, CGI.pm provides us with the old standard ReadParse. For instances where perl5 is not available, ReadParse is included in cgi-lib.pl.

ReadParse takes the variable/value pairs delivered to the program via CGI and makes these into key/value pairs in an associative array (hash). You can provide your own hash, or use the default, which is %in.

ReadParse Example

(execute)

ReadParse will take care of converting all of your strings back to their normal selves, separating variables, etc. What you get in your environment variables is exactly what the user submitted in the form or URL.

One thing to be very cautious of if you are programming in a language without bounds checking (eg, C) is to make sure that you include checks of your own when parsing CGI data. Buffer overruns are not your friends.

Maintaining State Information

One common task of a CGI program is to guide a user through multiple stages of some process; there are many situations where a single form with a single result are not appropriate. This could be because successive inputs depend on previous inputs. When this happens, you need to generate all of the HTML on the fly, and maintain state information as you go along.

It is important to keep in mind that your CGI program starts up completely stupid, reads in variables, and dies again each time you access it. Unless you take care of it yourself, all variable information is lost between accesses.

To pass state information along, use the HIDDEN input type. You can see an example above.

Here is an example CGI program that implements part of Scene 23 from Monty Python and the Quest for the Holy Grail. Well, sort of. What it should do is demonstrate how to pass along state information and how successive screens can depend on previous inputs. Keep in mind that we also have to keep track of which screen we are on (or which action to perform).

For the sake of brevity, the CGI program is here in text form. You can also execute the CGI here.

If you look at the example, you will see that we keep passing along the "name" CGI variable in order to know which element of the script to present next. In addition, setting the "page" variable tells us which part of the script we're viewing once we know the name. The absence of a "page" variable tells us that we haven't viewed anything yet, so we print the first page.

Other Environment Variables

In addition to the REQUEST_METHOD and QUERY_STRING variables, there are a few additional environment variables that you will find useful when writing CGI programs.

REMOTE_ADDR and REMOTE_HOST are the IP address and FQDN of the connecting machine, respectively. These are handy if you don't like a particular block of IPs or hostnames and want to give them a nasty message. For example, if ($ENV{REMOTE_HOST} =~ /\.AOL.COM$/i) { print "Lamer!"; }
HTTP_USER_AGENT lists the software used to connect and is the source of all kinds of fun. You will find that most report "Mozilla" and then say "compatible" in parenthesis for backward compatibility with something or other. I think it's kind of funny that MSIE has to say it is Mozilla compatible. One possible use of this variable is to send text-only for text browsers.
PATH_INFO is really cool. You can put a CGI in the middle of a path, and PATH_INFO will contain the rest of the path. For example, http://somehost/somecgi/this/is/a/cool/path.html will have PATH_INFO set to /this/is/a/cool/path.html. This is nice when you want your URLs to look nice and clean and is also nice for logging purposes.
REMOTE_USER If your page is protected (ie, the user must log in with a username and password) this will contain the username of the client user.
HTTP_COOKIE C is for Cookie, and that's good enough for me!

Advanced Stuff: cookies and whatnot

Cookies

Often it would be nice to store information in a more permanent fashion than just passing it along in a variable, or to keep information on the user's machine after they've left your site. For example, you may want to keep track of a user's timezone from one visit to the next. One way to accomplish this would be to create user accounts on the fly and have the user log in each time he or she visited your site. Users, however, are stupid and prone to forgetting usernames and passwords. A much cleaner, more transparent, and therefore easier on everyone involved is to set a cookie in the user's browser.

Before you have a panic attack about what you've heard about cookies, or what your Aunt Sally told you about cookies and the Good Times virus, let me tell you exactly what a cookie is and what it is not. A cookie is nothing more than a chunk of data that you leave on the user's machine, with an optional expiration date. (You wouldn't want your cookies to spoil, after all.) You can put whatever information you like in a cookie, and the web browser will always send the cookie back to you when it accesses your site until the cookie expires or it is unset.

It is possible to set a cookie that will be sent back to a different web server than the one that set the cookie, but I've never seen this done, and I also don't know how to do it.

Cookies can't track what other sites you've been to, what pages you've downloaded, where the kiddie porn is on your hard drive, or anything like that. They are simply a way to store information in a more permanent fashion, and nothing more.

That being said, here's how to set a cookie: before you send the Content-type header, send a Set-Cookie header. The Set-Cookie header typically looks like this:

Set-Cookie: key1=value1; key2=value2; keyn=valuen; expires=Wed, 17-Dec-2014 19:32:00 GMT

The expires key is optional; if you don't specify it, the cookie expiration will depend on the browser default. The expiration date is sent in the format Day, DD-Mon-YYYY HH:MM:SS GMT.

Your cookie will be delivered back to you in the HTTP_COOKIE environment variable. Here is an example perl subroutine for processing a cookie value into a hash.

Most browsers don't seem to care much about the format of the other data in the cookie, but the generally accepted form is shown above. You may want to put your values in quotes as well.

It is a good idea not to rely on cookies being there. Many paranoid users turn cookies off altogether, and some browsers don't support cookies at all. You should have a contingency in mind for people who can't deal with cookies so that your CGI program is accessible to all.

Whatnot

This section will expand as I find more interesting things to point out. For now, I'll mention briefly the Window-Target header. This spiffy header allows your CGI to specify which frame the output is to appear in, or which window if you have multiple browser windows around with different names. (And even if you don't, Netscape (and maybe IE?) will spawn off another browser window if you reference a window name that doesn't already exist.) The format is simple:

Window-target: window

For example, if you wanted to make sure that your program never came up in someone else's frame, you could use the special window name _top:

Window-target: _top

Remember that when you're using multiple headers, put the Content-type header last, with two newlines following it, and only one newline after each of the other headers. For example:

Set-Cookie: TZ=PDT; expires=Wed, 17-Jun-1998 00:00:00 GMT
Window-target: _top
Content-type: text/html

This example would set a cookie containing a TZ variable that expires on Jun 17, 1998, force the page to display on top (not in any frame) and specify that what follows is html.