[Reader-list] URLs exposing server architecture

Wed Mar 23 18:06:39 IST 2005

Follow-up to my last post on The URL as User Interface:  
http://mail.sarai.net/pipermail/reader-list/2005-February/005074.html

Part 3: URLs exposing server architecture.

A HTML file typically carries an extension of ".html". However,  
consider these examples:

1. http://www.bbc.co.uk/worldservice/index.shtml
2. http://www.royal.gov.uk/output/Page1.asp
3. http://www.xanga.com/register.aspx
4. http://gimp-print.sourceforge.net/MacOSX.php3
5. http://www.fanniemae.com/index.jhtml
6. http://www.poets.org/index.cfm
7. http://squishdot.org/987802018/index_html
8. http://www.telegram.com/apps/pbcs.dll/frontpage

All of these have different extensions, revealing the technology  
platform in use. In order: Apache Server Side Includes, Microsoft ASP,  
ASP.net, PHP 3, Java, Cold Fusion, Zope and Windows Dynamic Link  
Libraries. The trouble with including such a blatant platform signature  
in the URL is, should you choose to switch to a different platform, all  
your URLs change. Some platforms like Zope are insensitive to file  
extensions. You can use whatever you want and it'll still work. (In a  
case of taking this insensitivity too far, Zope is littered with  
index_html URLs.) Others like Apache-based platforms can be configured  
to use different extensions, but this typically requires a system-wide  
configuration change which your ISP may not be willing to do for you.

It is best to avoid identifying platform in your URLs. These examples  
are even worse:

1.  
http://www.amazon.com/exec/obidos/subst/home/home.html/104-0744072 
-3248744
2. http://store.apple.com/1-800-MY-APPLE/WebObjects/AppleStore.woa
3. http://plone.org/search?SearchableText=plone&b_start:int=30
4.  
http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2005/01/30/ 
wgerm30.xml

Notice that all URLs at amazon.com begin with "/exec/obidos", making  
that part of the URL semantically meaningless and unnecessary cruft in  
the URL. Further, home.html is followed by a slash and another path  
component. This breaks the file and folder hierarchy that the Web is  
built around. Browsers expect that folders contain other folders and  
files, and that files contain no sub-items. This is required for links  
with relative references to resolve properly. (When folder/a.html links  
to b.html, is it referring to folder/b.html or folder/a.html/b.html?)  
When a path component can behave like a file at times and a folder at  
other times, it  risks confusing the browser. (Zope's object database  
also has this problem. Zope solves it by inserting a base href tag in  
all HTML pages. This works well but is not an elegant solution.)

The second is the home page of Apple's online store, listing all their  
products. It seems like a simple matter to copy a link to any of the  
products listed there, but you'll find the link does not work when used  
anywhere else. Jim Roepcke deconstructs the Apple Store URL [1] to find  
that this is because it includes session related data that is not valid  
for anyone but the user it was generated for.

[1] http://jim.roepcke.com/1721

The third exhibits a characteristic of the Zope platform (which Plone  
is built on). The "b_start:int" in the URL signifies that b_start is an  
integer parameter. Zope includes several others like ":list" and  
":tokens". These are matters of internal architecture and should not  
appear in the URL.

The final, from the UK Telegraph, is rather interesting. main.jhtml is  
taking a parameter that appears to refer to a file on disk. What if you  
change the path and make it read another file, one that was not  
supposed to be shown to the public? This may seem a humorous hack, but  
it could be worse. Philip Greenspun describes a case of Harvard  
Business School rejecting 119 applicants [2] who edited a URL to check  
their application status.

[2] http://blogs.law.harvard.edu/philg/2005/03/08#a7726

Further Reading
---------------

Matthew P. Thomas documents cruft in URLs generated by various  
weblogging systems:
http://mpt.phrasewise.com/2003/07/26#a534

Mark Pilgrim documents the process to make Movable Type generate  
cruft-free URLs (warning! technical jargon):
http://diveintomark.org/archives/2003/08/15/slugs

Nathan Ashby-Kuhlman presents more real world examples:
http://www.ashbykuhlman.net/blog/2003/07/27/2227
http://www.ashbykuhlman.net/blog/2003/08/02/2224

Conclusion
----------

We have looked at various ways to construct a URL and what roles they  
serve. Should there be a doubt yet on how URLs are relevant to  
community, that is simple. To discuss the content of any web page, you  
need a URL that can be shared. Without a URL, you are left attempting  
to reproduce the content (which may be non-trivial for graphical or  
Flash content), and have no reference that others can visit. A simple  
URL is friendlier, and therefore a better URL.

My next few posts will explore the human side of the UI-Community  
linkup.

-- 
Kiran Jonnalagadda
http://www.pobox.com/~jace