News Outline Schedule Tutorials Project Tools Links

1: What is XML?

XML is a method for putting structured data in a text file. Structured Data means things like spreadsheets, financial transactions, configuration parameters, address books, technical drawings, or anything else that would have structured data. Technically, XML not even a language by itself - instead, it is a set of rules for designing text formats for data, so that they are easy to read, and produce by programs, without problems of ambiguity, platform dependancy, and lack of extensibility. When you see a program talking about how it uses XML, this means that the format it is storing data in, or passing that data through the network, was designed using the rules of XML.

XML files are made in standard ASCII text, just like HTML. Thus, unlike many binary data formats, if there is a problem with a file, you can look at it without the aid of a special program. This allows for easier debugging of client programs while developing them, and makes recovery of data from corrupted data files easier as well. One minor drawback of having XML based entirely on text files is that an XML file will always be larger than some equivalent binary file. However, this is not as large a drawback as it seems, since hard drive space is becomming less and less expensive, and there are compression utilities like zip and gzip available to compress the files for smaller storage.

It is important to remember that XML is not HTML, and HTML is not XML. Despite many similarities, you should not fool yourself into thinking that the two are the same, or that one is a subset of the other. While both use tags to delimit text, and attributes inside of those tags to configure them, there are important differences at the core of each. The rules for XML are much stricter than those for HTML, and missplaced closing tags, or quotation marks can render the entire XML file unusable, as it was not designed to be error-tolerant, like HTML was. In fact, the XML specification explicitly forbids applications from second-guessing the content of the XML file - if something does not fit, the application is to quit reading the file, and issue an error. Also, tags in XML do not have explicit meanings pre-defined for them like they do in HTML. In one XML file, the <b> tag may mean the same thing it does in HTML (bold text), and in another, it may be a tag used to specify the Back of a post card, or the second item in an ordered list which uses letters for identifiers. The XML-based version of HTML which is currently used for developing web pages is called XHTML, and it is a version of HTML that was modified to fulfill the requirements of being an XML-based language.

XML stands for eXtensible Markup Language, and it began development in 1996, as a descendant of SGML, (Standard Generalized Markup Language,) which was developed in the early 1980s.

2: What can XML be used for?

XML is used for many things. Among them are the configuration files for Firefox, (Note that in Windows, these files are typically stored in a sub-directory of C:\WINDOWS\Application Data\Mozilla\Profiles\default\, with the extension .rdf.) the word processor and spreadsheet files for the OpenOffice format called Open Document, as well as data and configuration files in a wide variety of other applications.

One major example of an application that uses XML for configuration would be the video game Civilization 4, which has the ability to be modified and expanded by users, to expand the game by adding features, or transform the game into a completely different game just by creating or altering the XML based configuration files for the game. Many network-based video games also use an XML-based format for transfering information across the network between client computers, and the server.

There are also several other applications of XML available, such as the W3 Consortium's MathML language (an XML based language for specifying mathematical functions in a document) and the SVG, an XML-based language for creating scalable vector images embedded in web pages.

3: How is it done?

The first step in creating an XML based language is to decide on what kind of data you wish to represent, and then determining a simple way of expressing it using tags, much like you would see in an XHTML file. As an example, here is a simple recipe for a cook book, which we can convert into an XML file: (WARNING: This recipe is probably completely inedible, and I don't recommend actually trying it!)

Simple Soup 1 1/2 tb Corn oil 1 md Onion,finely chopped (1/2 cup) 1 c Packed shredded carrot (2 medium) 1 cl Garlic,minced (1 heaping Tablespoon) 1 ts Dried basil,crumbled 4 ts Oregano,crumbled 1/2 ts Fresh ground black pepper, To taste 4 c Water 2 tb Vinegar Saute onion, carrot, and garlic, in the oil in a large skillet over medium heat, stirring them often, for about 5 minutes. Remove the vegetables from the heat and let them cool to room temperature. In a large pot, combine the vegetables with the spices, water, and vinegar, and mix well. Cook over medium heat for 30 minutes and serve.

Converted into a very simple XML type file, this recipe may look something like this:

<?xml version="1.0"?> <recipe> <title> Simple Soup </title> <ingredients> <i> <item>Corn oil</item> <measure units="tb" >1.5</measure> </i> <i> <item>Onion,finely chopped</item> <measure units="Cup" >0.5</measure> </i> <i> <item>Packed shredded carrot</item> <measure units="Cup" >1</measure> </i> <i> <item>Garlic,minced</item> <measure units="clove or heaping tablespoon" >1</measure> </i> <i> <item>Dried basil,crumbled</item> <measure units="ts" >1</measure> </i> <i> <item>Oregano,crumbled</item> <measure units="ts" >4</measure> </i> <i> <item>Fresh ground black pepper, To taste</item> <measure units="ts" >0.5</measure> </i> <i> <item>Water</item> <measure units="Cup" >4</measure> </i> <i> <item>Vinegar</item> <measure units="tb" >2</measure> </i> </ingredients> <instructions> <step> Saute onion, carrot, and garlic, in the oil in a large skillet over medium heat, stirring them often, for about 5 minutes. </step> <step> Remove the vegetables from the heat and let them cool to room temperature. </step> <step> In a large pot, combine the vegetables with the spices, water, and vinegar, and mix well. </step> <step> Cook over medium heat for 30 minutes and serve. </step> </instructions> </recipe>

Now that we've created a sample XML file, we should also define the grammar of that file, so that we can create more recipe files for other food items. We have several choices for how to do this, as there are currently four methods of specifying the structure of an XML file. Among them are: DTDs, and XML Schema.

3.1: DTDs

DTD stands for Document Type Definition, and was the first method used for specifying XML file structures. The language used for a DTD, while similar to XML in that it uses tags, is not actually XML itself, but rather SGML, (Standard Generalized Markup Language) an older markup language which XML was originally derived from. The most commonly used DTD files are the ones for the XHTML language, which are linked to at the start of all valid XHTML files. The DTD for XHTML 1.0 Transitional can be found at http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd.

There are only four actual tags used in a DTD file, and they are as follows:

<!-- ... --> Comment tags <!ELEMENT ... > Element tags <!ATTLIST ... > Attribute Listing tags <!ENTITY ... > "Entity" tags

The comment tag in a DTD file works exactly the same as it would in an HTML or XHTML file - it is there to allow the creator of the file to include notes to give meaning to the contents of the file, and if needed designate names for sections of that file to make it easier for humans to read.

The <!ELEMENT tag defines the tags in the XML file, and specifies what is allowed to go inside of them. Thus, in the example of the recipe file, the <!ELEMENT tag for the recipe tag would look like this:

<!ELEMENT recipe (title, ingredients, instructions) >

First, the name of the tag (recipe) is specified, and then a list of the tags allowed to occur inside of that tag is given. In the cases where actual raw text is allowed inside of the tag instead of just other tags, the special keyword #PCDATA is used. Similarly, if a tag is allowed to have no contents, then the special keyword EMPTY may be used. (This does create a conflict, since it means that you may not name an XML tag "EMPTY", but due to the case-sensitiveness of XML, tags named "empty" and "Empty" are both allowed.)

Next, the <!ATTLIST tag specifies what the attributes on an XML tag are allowed to be. Attributes are the extra parameters on an XML tag, such as the src="file.gif" on the XHTML img tag, or the units="Cup" attribute on the measure tag we used in the recipe file above. The format for it is similar to the <!ELEMENT tag, in that you give the tag name, then specify which tag you are giving attributes for, and last you give a list of the attributes for the tag in question. For our recipe file, the tag would look like this:

<!ATTLIST measure units CDATA #implied >

It is important to note that you must always remember to specify an <!ELEMENT tag for whatever tag you are creating before you give an <!ATTLIST for it. The <!ATTLIST does not create the tag, it only allows it to have attributes added on.

The <!ATTLIST entries also has parameters which specify what the attribute is allowed to contain, and optionally specify a default value or whether or not the attribute is required. In the example above "CDATA" is a special keyword which signifies that the attribute may contain any "Character Data". Alternately, you could specify a list of possible values, such as "0|1|2", which would signify that the acceptable values were "0", "1", or "2". The final part of the attribute listing is optional, and consists of either #IMPLIED to say that the attribute is "optional" in the file, #REQUIRED to say that the attribute must be defined for the file to be valid, or else a default value can be given, which will be assumed by the program reading the file if the attribute is not defined. An example of an ATTLIST giving each of these would look like this:

<!ELEMENT myTag (#PCDATA) > <!ATTLIST myTag thisAttrib CDATA #IMPLIED theColour "black|silver|gray|white|maroon|red|purple|fuchsia|green|lime|olive|yellow|navy|blue|teal|aqua" #REQUIRED thatAttrib "0|1" "0" >

The last of the XML tags for DTDs is <!ENTITY, which is used to define frequently re-used sections of text as macros, to save space later. An example of this tag being used would look like this:

<!ENTITY % colornames "black|silver|gray|white|maroon|red|purple|fuchsia|green|lime|olive|yellow|navy|blue|teal|aqua" >

There are a few important things to note here - first is the name of the entity being created - colornames - any time it is actually being used inside of the DTD file, it will be printed as %colornames - in the <!ENTITY tag itself however, there is a single space between the % and the name of the entity. This allows you to search through a DTD file to find either the definition of the entity (search with the space) or all of the places where an entity is actually used. (Search for it with no space.) Entities can be used anywhere inside of a DTD file to save repetitive typing, for example, in the DTD for XHTML 1.0 Transitional, there is an attribute which defines the names and contents of the basic set of attributes which are allowed in every XHTML tag - then instead of having to include a line specifying that each tag can have an id, style, class, and so on, they simply use the entity %attrs to represent them in the 73 places where they would otherwise have had to re-type those items. (This represents a massive reduction in the size of the file, since there are a total of 17 lines being represented each of those 73 times by only 6 letters. That's over 1000 un-needed lines removed from the file.)

An example of a DTD for the recipe language we've created would look like this:

<!ELEMENT recipe (title, ingredients, instructions) > <!ELEMENT title (#PCDATA) > <!ELEMENT ingredients (i+) > <!ELEMENT instructions (step+) > <!ELEMENT i (item, measure) > <!ELEMENT item (#PCDATA) > <!ELEMENT measure (#PCDATA) > <!ATTLIST measure units CDATA #IMPLIED > <!ELEMENT step (#PCDATA) >

3.2: XML Schema

XML Schema is a newer method of specifying XML file structure, which is actually created as an XML type language itself. It is considerably more flexible than the DTD method, and provides more support for advanced XML features such as namespaces, and allow much more detailed constraints on an XML file. The cost of this extra flexibility and expressiveness is that XML Schema are more complicated than a DTD. For example, the definition for the recipe file above in an XML Schema Definition would look like this:

<xml version="1.0" encoding="ISO-8859-1"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="recipe" type="Recipe"/> <xs:complexType name="Recipe"> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="ingredients" type="Ingredients"/> <xs:complexType name="Ingredients"> <xs:sequence> <xs:element name="i" type="I" maxOccurs="unbounded" /> <xs:complexType name="I"> <xs:sequence> <xs:element name="item" type="xs:string"/> <xs:element name="measure" type="xs:decimal"> <xs:attribute name="units" type="xs:string" /> </xs:element> </xs:sequence> </xs:complexType> </xs:sequence> </xs:complexType> <xs:element name="instructions" type="Instructions"/> <xs:complexType name="Instructions"> <xs:sequence> <xs:element name="step" type="xs:string" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> </xs:sequence> </xs:complexType> </xs:schema>

An excellent tutorial for how to create an XML Schema can be found at the W3 Schools page on XML Schema.