Web Authoring FAQ: Getting Started

This document answers questions asked frequently by web authors. While its focus is on HTML-related questions, this FAQ also answers some questions related to CSS, HTTP, JavaScript, server configuration, etc.

This document is maintained by Darin McGrew <darin@htmlhelp.com> of the Web Design Group, and is posted regularly to the newsgroup comp.infosystems.www.authoring.html. It was last updated on October 20, 2001.

Section 1: Getting Started

What is everyone using to write HTML?
How can I show HTML examples without them being interpreted as part of my document?
How do I get special characters in my HTML?
Should I put quotes around attribute values?
How can I include comments in HTML?
How can I avoid using the whole URL?
Should I end my URLs with a slash?
How can I check for errors?
What is a DOCTYPE? Which one do I use?

The following questions have moved to another section of the FAQ.

3.1. What is everyone using to write HTML?

It seems that everyone has a different preference for which tool works best for them. You may find lists of HTML authoring tools at:

http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/HTML_Editors/
http://www.winfiles.com/ (search "HTML Editors")
http://www.tucows.com/ (Win 95, Win 3.x, Macintosh, OS/2)
http://shareware.cnet.com/ (search "HTML editor")

Keep in mind that typically the less HTML the tool requires you to know, the worse the output of the HTML. In other words, you can always do it better by hand if you take the time to learn a little HTML.

[Table of Contents]

3.2. How can I show HTML examples without them being interpreted as part of my document?

Within the HTML example, first replace the "&" character with "&" everywhere it occurs. Then replace the "<" character with "<" and the ">" character with ">" in the same way.

The next Q&A addresses the more general issue of representing arbitrary characters in HTML documents.

[Table of Contents]

3.3. How do I get special characters in my HTML?

The answer to the previous question addressed the special case of the less-than ('<'), greater-than ('>'), and ampersand ('&') characters. In general, the safest way to write HTML is in US-ASCII (ANSI X3.4, a 7-bit code), expressing characters from the upper half of the 8-bit code by using HTML entities. See the answer to "Which should I use, &entityname; or &#number; ?"

Working with 8-bit characters can also be successful in many practical situations: Unix and MS-Windows (using Latin-1), and also Macs (with some reservations).

The available characters are those in ISO-8859-1, listed at <URL:http://www.htmlhelp.com/reference/charset/>. ISO-8859-1 is intended for English, French, German, Spanish, Portuguese, and various other western European languages. (It is inadequate for many languages of central and eastern Europe and elsewhere, let alone for languages not written in the Roman alphabet.) On the Web, these are the only characters reliably supported. In particular, characters 128 through 159 as used in MS-Windows are not part of the ISO-8859-1 code set and will not be displayed as Windows users expect. These characters include the em dash, en dash, curly quotes, bullet, and trademark symbol; neither the actual character (the single byte) nor its &#nnn; decimal equivalent is correct in HTML. Also, ISO-8859-1 does not include the Euro currency character. (See the last paragraph of this answer for more about such characters.)

On platforms whose own character code isn't ISO-8859-1, such as MS-DOS and Mac OS, there may be problems: you have to use text transfer methods that convert between the platform's own code and ISO-8859-1 (e.g., Fetch for the Mac), or convert separately (e.g., GNU recode). Using 7-bit ASCII with entities avoids those problems, but this FAQ is too small to cover other possibilities in detail. Mac users - see the notes at <URL:http://www.htmlhelp.com/reference/charset/>.

If you run a web server (httpd) on a platform whose own character code isn't ISO-8859-1, such as a Mac or an IBM mainframe, then it's the job of the server to convert text documents into ISO-8859-1 code when sending them to the network.

If you want to use characters not in ISO-8859-1, you must use HTML 4 or XHTML rather than HTML 3.2, choose an appropriate alternative character set (and for certain character sets, choose the encoding system too), and use one method or other of specifying this. See the HTML 4.01 Recommendation at <URL:http://www.w3.org/TR/html4/> and the Babel site at <URL:http://babel.alis.com:8080/> for more details. Another useful resource for internationalization issues is at <URL:http://ppewww.ph.gla.ac.uk/%7Eflavell/charset/>.

[Table of Contents]

3.4. Should I put quotes around attribute values?

It is never wrong to quote attribute values, and many people recommend quoting all attribute values even when the quotation marks are technically optional. XHTML 1.0 requires all attribute values to be quoted. Like previous HTML specifications, HTML 4 allows attribute values to remain unquoted in many circumstances (e.g., when the value contains only letters and digits). See <URL:http://www.w3.org/TR/html4/intro/sgmltut.html#attributes> for the exact rules.

Be careful when your attribute value includes double quotes, for instance when you want ALT text like "the "King of Comedy" takes a bow" for an image. Humans can parse that to know where the quoted material ends, but browsers can't. You have to code the attribute value specially so that the first interior quote doesn't terminate the value prematurely. There are two main techniques:

Escape any quotes inside the value with " so you don't terminate the value prematurely: ALT="the "King of Comedy" takes a bow". (" is not part of the formal HTML 3.2 spec, though most current browsers support it.)
Use single quotes to enclose the attribute value: ALT='the "King of Comedy" takes a bow'.

Both these methods are correct according to the spec and are supported by current browsers, but both were poorly supported in some earlier browsers. The only truly safe advice is to rewrite the text so that the attribute value need not contain quotes, or to change the interior double quotes to single quotes, like this: ALT="the 'King of Comedy' takes a bow".

[Table of Contents]

3.5. How can I include comments in HTML?

A comment declaration starts with "<!", followed by zero or more comments, followed by ">". A comment starts and ends with "--", and does not contain any occurrence of "--" between the beginning and ending pairs. This means that the following are all legal HTML comments:





<!>

But some browsers do not support the full syntax, so we recommend you follow this simple rule to compose valid and accepted comments:

An HTML comment begins with "", and does not contain "--" or ">" anywhere in the comment. Do not put comments inside tags (i.e., between "<" and ">") in HTML markup.

See <URL:http://www.htmlhelp.com/reference/wilbur/misc/comment.html> for a more complete discussion.

[Table of Contents]

3.6. How can I avoid using the whole URL?

The URL structure defines a hierarchy similar to a filesystem's hierarchy of subdirectories or folders. The segments of a URL are separated by slash characters ("/"). When navigating the URL hierarchy, the final segment of the URL (i.e., everything after the final slash) is similar to a file in a filesystem. The other segments of the URL are similar to the subdirectories and folders in a filesystem.

A relative URL omits some of the information needed to locate the referenced document. The omitted information is assumed to be the same as for the base document that contains the relative URL. This reduces the length of the URLs needed to refer to related documents, and allows document trees to be accessed via multiple access schemes (e.g., "file", "http", and "ftp") or to be moved without changing any of the embedded URLs in those documents.

Before the browser can use a relative URL, it must resolve the relative URL to produce an absolute URL. If the relative URL begins with a double slash (e.g., //www.htmlhelp.com/faq/html/), then it will inherit only the base URL's scheme. If the relative URL begins with a single slash (e.g., /faq/html/), then it will inherit the base URL's scheme and network location.

If the relative URL does not begin with a slash (e.g., all.html , ./all.html or ../html/), then it has a relative path and is resolved as follows.

The browser strips everything after the last slash in the base document's URL and appends the relative URL to the result.
Each "." segment is deleted (e.g., ./all.html is the same as all.html, and ./ refers to the current "directory" level in the URL hierarchy).
Each ".." segment moves up one level in the URL hierarchy; the ".." segment is removed, along with the segment that precedes it (e.g., foo/../all.html is the same as all.html, and ../ refers to the parent "directory" level in the URL hierarchy).

Some examples may help make this clear. If the base document is <URL:http://www.htmlhelp.com//faq/html/basics.html>, then

all.html and ./all.html: refer to <URL:http://www.htmlhelp.com//faq/html/all.html>
./: refers to <URL:http://www.htmlhelp.com//faq/html/>
../: refers to <URL:http://www.htmlhelp.com//faq/>
../cgifaq.html: refers to <URL:http://www.htmlhelp.com//faq/cgifaq.html>
../../reference/: refers to <URL:http://www.htmlhelp.com//reference/>

Please note that the browser resolves relative URLs, not the server. The server sees only the resulting absolute URL. Also, relative URLs navigate the URL hierarchy. The relationship (if any) between the URL hierarchy and the server's filesystem hierarchy is irrelevant.

For a full discussion of the proper form of URLs, see <URL:http://www.w3.org/Addressing/>.

[Table of Contents]

3.7. Should I end my URLs with a slash?

When resolving relative URLs (see the answer to the previous question), the browser's first step is to strip everything after the last slash in the URL of the current document. If the current document's URL ends with a slash, then the final segment (the "file") of the URL is null. If you remove the final slash, then the final segment of the URL is no longer null; it is whatever follows the final remaining slash in the URL. Removing the slash changes the URL; the modified URL refers to a different document and relative URLs will resolve differently.

For example, the final segment of the URL http://www.htmlhelp.com//faq/html/ is empty; there is nothing after the final slash. In this document, the relative URL all.html resolves to http://www.htmlhelp.com//faq/html/all.html (an existing document). If the final slash is omitted, then the final segment of the modified URL http://www.htmlhelp.com//faq/html is "html". In this (nonexistent) document, the relative URL all.html would resolve to http://www.htmlhelp.com//faq/all.html (another nonexistent document).

When they receive a request that is missing its final slash, web servers cannot ignore the missing slash and just send the document anyway. Doing so would break any relative URLs in the document. Normally, servers are configured to send a redirection message when they receive such a request. In response to the redirection message, the browser requests the correct URL, and then the server sends the requested document. (By the way, the browser does not and cannot correct the URL on its own; only the server can determine whether the URL is missing its final slash.)

This error-correction process means that URLs without their final slash will still work. However, this process wastes time and network resources. If you include the final slash when it is appropriate, then browsers won't need to send a second request to the server.

The exception is when you refer to a URL with just a hostname (e.g., http://www.htmlhelp.com). In this case, the browser will assume that you want the main index ("/") from the server, and you do not have to include the final slash. However, many regard it as good style to include it anyway.

For a full discussion of the proper form of URLs, see <URL:http://www.w3.org/Addressing/>.

[Table of Contents]

3.8. How can I check for errors?

Various software is available to find errors in your web documents automatically. HTML validators are programs that check HTML documents against a formal definition of HTML syntax and then output a list of errors. Validation is important to give the best chance of correctness on unknown browsers (both existing browsers that you haven't seen and future browsers that haven't been written yet).

HTML linters (checkers) are also useful. These programs check documents for specific portability problems, including some caused by invalid markup and others caused by common browser bugs. Linters may pass some invalid documents, and they may fail some valid ones.

All validators are functionally equivalent; while they may have different reporting styles, they will find the same errors given identical input. Different linters are programmed to look for different problems, so their reports will vary significantly from each other. Also, some programs that are called validators (e.g. the "CSE HTML Validator") are really linters/checkers. They are still useful, but they should not be confused with real HTML validators.

When checking a site for errors for the first time, it is often useful to identify common problems that occur repeatedly in your markup. Fix these problems everywhere they occur (with an automated process if possible), and then go back to identify and fix the remaining problems.

While checking for errors in the HTML, it is also a good idea to check for hypertext links which are no longer valid. There are several link checkers available for various platforms which will follow all links on a site and return a list of the ones which are non-functioning.

You can find a list of validators, linters, and link checkers at <URL:http://www.htmlhelp.com//links/validators.htm>. Especially recommended is the use of an SGML-based validator such as the WDG HTML Validator <URL:http://www.htmlhelp.com//tools/validator/> or W3C HTML Validation Service <URL:http://validator.w3.org/>.

[Table of Contents]

3.9. What is a DOCTYPE? Which one do I use?

According to HTML standards, each HTML document begins with a DOCTYPE declaration that specifies which version of HTML the document uses. The DOCTYPE declaration is useful primarily to SGML-based tools like HTML validators, which must know which version of HTML to use in checking the document's syntax. Browsers generally ignore DOCTYPE declarations.

See <URL:http://www.htmlhelp.com//tools/validator/doctype.html> for information on choosing an appropriate DOCTYPE declaration.

Note that the public identifier section of the DOCTYPE declaration is case sensitive. Some versions of Netscape Composer are known to insert the lower-case "-//w3c//dtd html 4.0 transitional//en", rather than the correct mixed-case "-//W3C//DTD HTML 4.0 Transitional//EN".

[Table of Contents]

Copyright © 1996-2001 by the Web Design Group. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/).