Thursday

Invalid email address

Has this ever happened to you? A form on a website requires you to enter an email address, but you don't want to enter your real one, so you put something like "NO" in the email address field. When you try to submit the form, the clever developer has found a way to detect your deception! You're told "invalid email address" or something, and the site makes you fill it in again. So people will put something like "no@no.com" and this time, the developer is outwitted, and the form accepts your entry.

Well, the method that developers typically use to decide whether your text is a valid email address is called "Regular Expressions". The idea in this case is that you try to come up with a coded expression which describes every possible valid email address. In theory, it shouldn't be that hard. There's a specification for email addresses (RFC 5322), and you can just convert that spec to a regex.

And people have done that... but wait. The actual specification allows email addresses like "Bob Jones"@[4.2.2.1] that, while technically valid, probably won't be sent properly by a lot of mail hosts. And the fully compliant regex is... pretty long.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
I'm not even sure that's correct, but I'll take this guy's word for it. Actually, go ahead and check out that link. They cover this topic in pretty good detail.

Anyway, implementing the full regex is a bit awkward, and it's an interesting challenge, so a lot of people make up their own. Poorly. I think I've probably done it before. Check out this one, which I came across in an ugly piece of software I tried to adapt to my uses recently:
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
This matches "user@[999.999.999]" and "user@google.4]" and "...---@0.0.0.0.0.0" (not valid email addresses) but not "user@travel.travel" (a valid address) or "user@localhost" (uses local scope name resolution, but not technically invalid. I've sent mail to it before.). Anyway, it's not a great regular expression. It's okay, but a quick google search would have saved this developer a lot of time writing this thing.

And really, developers, isn't this entire process almost totally futile? The user could always enter "aaaaa@yahoo.com", and your regex would accept this misdirection as a valid email address. Using email-address regex algorithms only inconveniences legitimate users (for instance, the ones like me who want to use sub-addressing), while malicious users can continue to circumvent the system simply by learning the new rules.

The only potentially useful case that I can envision for this sort of thing is one like the kind Jan Goyvaerts mentions (in the authoritative post that I linked to above): if a user mistypes their email address by typing, for example, something like "user@yahoo", obviously they forgot to type the TLD at the end. But in this case, it's not necessary to totally block the user from continuing with the form. Developers, please: just alert your user that you've detected what seems to be an invalid email address, but let them continue if they don't want to change it. Allow your user to shoot themselves in the foot, if they insist on doing it. The alternative, blocking users who don't comply with your vision, only promotes the kind of arrogant smarter-than-thou paternalistic attitude that characterizes negative stereotypes of software developers. Don't be that guy (or girl).


This has probably all been said before, more eloquently, elsewhere. It's just been building up for years, and I wanted to vent my frustration in a constructive way. Thanks for listening. Oh yeah, and if you have good counter-examples, of course I'd like to hear them.

0 comments: