Skip to content

The longest word that can’t be expressed as a URL hack

Colin and I were playing Domainr, which is a pretty fun and fantastic site for generating domain hacks. Of course, I already own the best one (wait what), but there’s still a ton of joy in finding out, among other things, that quiteajo.lt is available! (although I’m sure it won’t be five seconds after I hit “Publish”)

Actually, Domainr also produces quite a few things that aren’t technically domain hacks, because a domain hack, by virtue of the first word in the phrase, is a domain name. Domainr searches for top-level domains anywhere in the input string, and if there’s spillover text after the TLD, Domainr will slop the rest of the text over into a first-level directory. For example, in addition to quiteajo.lt, Domainr also produces quitea.jo/lt and qu.it/eajolt. As far as I know, there isn’t a term for this, and I’m not really comfortable with calling it a domain hack, so I’m just going to call it a “URL hack” because the word or phrase forms the whole URL (without the scheme, but I think we can just assume it’s HTTP).

Finding the longest word that can’t be a domain hack is boring: just find the longest word that doesn’t end in a top-level domain. So I’m kind of curious about what’s the longest word that can’t be expressed as a URL hack.

Some top-level domains have structural restrictions on what domains can be registered, mostly in the form of country code TLDs where direct second-level registration isn’t allowed. Registrations for .al can only occur under five second-level domains, for instance. I’ll try to take into account as many of these as possible.

Many of the registrars for the country code top-level domains also restrict registrations to entities that meet certain conditions, like “having a presence in the country” or “being a government agency.” I’ll just assume you somehow the ability to meet these requirements, no matter how unrealistic they are (you will never, ever get a .mil domain). Some domains also have no active registrar or do not accept any public registrations, so I’ve left them off when I’ve actually noticed them.

Anyway, let’s code. I could go for some Python right about now.

unrestrictedTLDs = ["biz", "info", ...]
restrictedTLDs = ["com", "net", "org"]
unrestrictedSubdomains = ["goval", "edual", ...]

def isURLHack(word):
    if any(tld in word[1:] for tld in unrestrictedTLDs):
        return True
    if any(tld in word[2:] for tld in restrictedTLDs):
        return True
    if any(subdomain in word[1:] for subdomain in unrestrictedSubdomains):
        return True
    return False

words = file("/usr/share/dict/words")

longestWords = [""]
for word in words:
    trimmedWord = word.rstrip()
    if (len(longestWords[0]) <= len(trimmedWord) and not isURLHack(trimmedWord)):
        if (len(longestWords[0]) < len(trimmedWord)):
            longestWords = []
        longestWords.append(trimmedWord)

print longestWords

The algorithm here is conceptually the same as the one Domainr uses to find URL hacks: find out whether any top-level domain is a substring of the input word. If one is, the characters before the TLD become the second-level domain, and the text after become the directory. This program should apply this test to all words in /usr/share/dict/words and print out the longest word that doesn't pass it, or words if there's a tie for first place.

Explanations about the three arrays at the top of the code:

The unrestricted TLDs are top-level domains like .biz, where registration of second-level domains is wide open, and so in a URL hack they could conceivably appear anywhere except the very beginning. I said earlier for a URL hack that "the word or phrase forms the whole URL"; the domain name in a URL has to include a second-level domain, so at least one character must appear before the TLD. I've truncated the list above due to length (191 elements), but I've created a text file with the contents.

The restricted TLDs are the three top-level domains where single-letter second-level domains are not allowed. Any SLDs for these domains must contain at least two characters, so we start searching for the TLD two characters into the word instead of one.

The unrestricted subdomains are the open subdomains of top-level domains that restrict registration in some fashion. You can't register a domain under .al, for instance, but you can under .com.al, so "comal" appears on the unrestricted subdomain list. Under the subdomain, we can register whatever we want, so for our purposes we can just treat the whole subdomain as a TLD. Again, this list is way too long to show above, so I've made another text file with the 518 elements.

Enough about that, results! Or rather, result! Our winner, at seventeen characters, is:

chzbook:~ CHz$ python urlhack.py
['plenipotentiarize']

A quick stop at Domainr confirms that this "word" (I think this may be the first time anyone has ever used it) cannot be expressed as a full URL in any way. It starts with .pl (but is not a URL because the domain name has to have an SLD) and contains both .ni and .ar (but is not a URL because both TLDs restrict registrations to third-level domains, none of which are matched by the word).

I made a quickie adjustment to the script to find all words in /usr/share/dict/words that are fifteen characters or longer:

17 characters
plenipotentiarize
16 characters
counterartillery, interconvertible
15 characters
counterparallel, pentahexahedral, Plenipotentiary, plenipotentiary, spondylodidymia, spondylotherapy, subcontraoctave, tetrahexahedral, ventripotential, vertebrarterial

Sorry to disappoint the legions of people who were hoping for a URL hack for "spondylodidymia." Guess we'll all have to wait until some top-level domains are added. :(

Post a Comment

You must be logged in to post a comment.