A Strategy for Domain Name Canonicalization

Posted by sam Thu, 18 Mar 2010 17:47:00 GMT

I recently had to solve a problem for a client at quite short notice. It is a twist on the normal domain canonicalization steps that we might use.

The client had three major requirements:

  • The client had 20 publically accessible domains, of which 3 were “canonical” and represented their presence in three different terriorities. The rest were legacy, commerical variants, or “catch-alls”. They had 4 internally-resolvable domain to access various management aspects of the application (all ACL protected too, of course!).
  • The client required that if a user visit the site with a plain domain name, it should be canonicalized to the www. version with a 301 redirect
  • The site must be SSL protected.

SSL certificates only existed for the three main domains.

Usually, within this particular part of infrastructure we would move the SSL burden to the load balancer, and have that issue a 302 redirect to the https protected site automatically, preserving any of the path portion of the URL that had been supplied too.

As SSL certificates existed only for the three canonical domains this wasn’t an option: we can’t issue a redirect to https until we canonicalized the domain (the certificates were not wildcards, even for the three that existed). Canonicalization would take place after the connection would hit the load balancer. It just would not work.

I took the decision to use the Apache instance fronting-up the application server to handle all of these tasks, as it was the natural place for them to sit. The load balancers were relegated to fail-over, session persistence and NATing only.

A new Apache was recompiled that included:

Now, to tackle domain name canonicalization.

When using name-based virtual hosts Apache serves content from the first default vhost where the requested domain does not match the ServerName or ServerAlias of any of the other named vhosts.

I took advantage of this and had the default vhost configured with an internally resolvable hostname, thereby creating a catch-all for any requests not to canonical domains.

The following configuration directives were used:

RewriteEngine On

# Alternative hostname, with or without a www. prefix
# Repeat for all non-canonical domains
RewriteCond %{HTTP_HOST}    ^.*notcanonical\.test$ [NC,OR]
RewriteCond %{HTTP_HOST}    ^.*alsonotcanonical\.test$ [NC,OR]

# Canonical hostname without www. prefix
# Repeat for all canonical domains
RewriteCond %{HTTP_HOST}    ^canonical\.test$ [NC,OR]
RewriteRule ^/(.*)          https://www.canonical.test/$1 [L,R=301]

RewriteCond %{HTTP_HOST}    ^anothercanonical\.test$ [NC,OR]
RewriteRule ^/(.*)          https://www.anothercanonical.test/$1 [L,R=301]

There are some important details in here.

  • By redirecting straight to the desired canonical domain with https:// prefix the request cycle is shortened to only one possible 301 redirect. It would be possible to simply let the URL iterate over the conditions multiple times, perhaps for ease of maintenance and DRYness, but you risk increasing the response cycle length.
  • The ‘NC’ flag ensures that case is ignored, and we add ‘OR’ to override the explicit AND for RewriteConds.
  • When we issue the Rewrite we the ‘L’ flag is included to stop multiple iterations through the request cycle, and we override the default 302 issued (temporary) with a 301.
  • The brackets within the RewriteRule regex capture any matches into a named variable ($1). We use this in the 301 target URL to preserve any of the path portion of the URL. The hostname and port are not parsed with this type of rule.

The final part of the puzzle is configuring two named virtual hosts for each canonical domain. The first listens on port 80 for plain http and is there to handle the situation whereby the user visits the canonical domain, but does so via plain http. Within the vhost container I used the RedirectMatch directive of mod_alias to issue a 301 redirect to the https:// version of the URL.

RedirectMatch 301 (.*)$ https://www.canonical.test$1

The second named host listens on port 443 and is configured to handle SSL in the usual manner. Within the vhost container we proxy the back-end application with something like:

ProxyRequests Off
< Proxy * >
    Order deny,allow
    Allow from all
</Proxy>
ProxyPass / http://appserver.internal/
ProxyPassReverse / http://appserver.internal/

Building and installing mod_proxy_html and mod_xml2enc

Posted by sam Tue, 17 Nov 2009 16:01:00 GMT

Introduction

The mod_proxy_html module from Webthing is to the body of an HTTP request what mod_proxy is to the headers. It is especially useful for frigging with the output of a site or application that is being proxied and has made the interesting decision to use absolute rather than relative URLs in its output and links. Of course, if you are proxying, say, an internal application to the outside world, your internal DNS namespace is not going to be resolvable to visitors, and the links will not work.

For the uninitiated the mod_proxy_html homepage doesn’t give much clue about how to build and install the module. Here is a quick, platform independent, guide.

Assumptions

  • You have a working build environment (`which gcc`, `which make`, etc.)
  • You have libxml2 and libxml2-devel installed, and the includes are present in /usr/include/libxml2
  • You are using Apache 2.x and its base directory is /usr/local/apache
  • You have unpacked the source into /usr/local/src/mod_proxy_html, and the two mod_xml2enc source files into /usr/local/src/mod_xml2enc [1][2].

Satisfying Dependencies

The module mod_xml2enc is required to build mod_proxy_html. You don’t have to use this module if you only want to parse ASCII, but it’ll log an anoying message on every restart if you don’t. Given both of these things you may as well build it and use it.

Building mod_xml2enc

Within /usr/local/apache/bin execute the command:

./apxs -aic -I/usr/include/libxml2 /usr/local/src/mod_xml2enc/mod_xml2enc.c

The ‘c’ option actually does the compiling, whilst ‘a’ activates the module by placing the configuration directive into your httpd.conf file, and ‘i’ installs the compiled DSO into your Apache’s ‘modules’ subdirectory. As we’ll see in a minute, just loading the modules with ‘LoadModule’ (as added by the ‘a’ option) isn’t enough to get going.

Providing you see no error messages you’re good to go.

Building mod_proxy_html

Within /usr/local/apache/bin execute the command:

./apxs -aic -I/usr/include/libxml2 -I/usr/local/src/mod_xml2enc /usr/local/src/mod_proxy_html/mod_proxy_html.c

Again, all being well you’ll have the module compiled installed and activated. Notice how we need to include the mod_xml2enc directory too, it’s not a typo. However do not restart your Apache instance just yet.

Loading libxml2 into Apache

Before you restart you Apache instance, as is good practice, test the syntax with the httpd -t command:

# /usr/local/apache/bin/httpd -t
httpd: Syntax error on line 56 of /usr/local/www1/conf/httpd.conf: Cannot load /usr/local/www1/modules/mod_proxy_html.so into server: ld.so.1: httpd: fatal: relocation error: file /usr/local/www1/modules/mod_proxy_html.so: symbol htmlFreeParserCtxt: referenced symbol not found

This is because you need to also load the libxml2.so file into Apache. To do this, open up /usr/local/apache/conf/httpd.conf and locate the two LoadModule lines added by apxs:

LoadModule proxy_html_module modules/mod_proxy_html.so
LoadModule xml2enc_module modules/mod_xml2enc.so

Immediately before the first of these two lines, add:

LoadFile /usr/lib/libxml2.so

Test as before:

/usr/local/apache/bin/httpd -t
Syntax OK

You can now restart your Apache and follow the guides on the site to using the module.

Nothing works! - "No links configured: nothing for proxy-html filter to do"

If nothing seems to be rewritten on your site, try alerting the LogLevel statement in your httpd.conf to be ‘info’. If you then start seeing the above message in your logs you’d do well to take a look at this site, which will explain exactly why. Suffice to say, documentation in general is lacking for this module.