A Strategy for Domain Name Canonicalization

Posted by sam Thu, 18 Mar 2010 17:47:00 GMT

I recently had to solve a problem for a client at quite short notice. It is a twist on the normal domain canonicalization steps that we might use.

The client had three major requirements:

  • The client had 20 publically accessible domains, of which 3 were “canonical” and represented their presence in three different terriorities. The rest were legacy, commerical variants, or “catch-alls”. They had 4 internally-resolvable domain to access various management aspects of the application (all ACL protected too, of course!).
  • The client required that if a user visit the site with a plain domain name, it should be canonicalized to the www. version with a 301 redirect
  • The site must be SSL protected.

SSL certificates only existed for the three main domains.

Usually, within this particular part of infrastructure we would move the SSL burden to the load balancer, and have that issue a 302 redirect to the https protected site automatically, preserving any of the path portion of the URL that had been supplied too.

As SSL certificates existed only for the three canonical domains this wasn’t an option: we can’t issue a redirect to https until we canonicalized the domain (the certificates were not wildcards, even for the three that existed). Canonicalization would take place after the connection would hit the load balancer. It just would not work.

I took the decision to use the Apache instance fronting-up the application server to handle all of these tasks, as it was the natural place for them to sit. The load balancers were relegated to fail-over, session persistence and NATing only.

A new Apache was recompiled that included:

Now, to tackle domain name canonicalization.

When using name-based virtual hosts Apache serves content from the first default vhost where the requested domain does not match the ServerName or ServerAlias of any of the other named vhosts.

I took advantage of this and had the default vhost configured with an internally resolvable hostname, thereby creating a catch-all for any requests not to canonical domains.

The following configuration directives were used:

RewriteEngine On

# Alternative hostname, with or without a www. prefix
# Repeat for all non-canonical domains
RewriteCond %{HTTP_HOST}    ^.*notcanonical\.test$ [NC,OR]
RewriteCond %{HTTP_HOST}    ^.*alsonotcanonical\.test$ [NC,OR]

# Canonical hostname without www. prefix
# Repeat for all canonical domains
RewriteCond %{HTTP_HOST}    ^canonical\.test$ [NC,OR]
RewriteRule ^/(.*)          https://www.canonical.test/$1 [L,R=301]

RewriteCond %{HTTP_HOST}    ^anothercanonical\.test$ [NC,OR]
RewriteRule ^/(.*)          https://www.anothercanonical.test/$1 [L,R=301]

There are some important details in here.

  • By redirecting straight to the desired canonical domain with https:// prefix the request cycle is shortened to only one possible 301 redirect. It would be possible to simply let the URL iterate over the conditions multiple times, perhaps for ease of maintenance and DRYness, but you risk increasing the response cycle length.
  • The ‘NC’ flag ensures that case is ignored, and we add ‘OR’ to override the explicit AND for RewriteConds.
  • When we issue the Rewrite we the ‘L’ flag is included to stop multiple iterations through the request cycle, and we override the default 302 issued (temporary) with a 301.
  • The brackets within the RewriteRule regex capture any matches into a named variable ($1). We use this in the 301 target URL to preserve any of the path portion of the URL. The hostname and port are not parsed with this type of rule.

The final part of the puzzle is configuring two named virtual hosts for each canonical domain. The first listens on port 80 for plain http and is there to handle the situation whereby the user visits the canonical domain, but does so via plain http. Within the vhost container I used the RedirectMatch directive of mod_alias to issue a 301 redirect to the https:// version of the URL.

RedirectMatch 301 (.*)$ https://www.canonical.test$1

The second named host listens on port 443 and is configured to handle SSL in the usual manner. Within the vhost container we proxy the back-end application with something like:

ProxyRequests Off
< Proxy * >
    Order deny,allow
    Allow from all
</Proxy>
ProxyPass / http://appserver.internal/
ProxyPassReverse / http://appserver.internal/
Trackbacks

Use the following link to trackback from your own site:
http://sam-pointer.com/trackbacks?article_id=47