foureleven.org
||   || ||

Email Address Validation (done right)

$Revision: 1.3 $
$Date: 2008/08/27 20:20:53 $
Thanks to Skeeve for the correction!

1.0 Introduction

How many naive validation routines have we seen that look like:

/[\w\-\.]+\@(\w\.)+\w{2,3}/

or similarly equivalent procedural approaches. Regardless of the approach, they are horribly broken, often authored and used without regard to standards, input checking, modularity. The above example fails both lexical and syntactic analysis, allowing mailbox@-my_site.example.com, while also disallowing the legitimate, "%20 + \"\\ + at@at"@example.arpa or name+mailbox@example.com.

1.1 Paul Warren's Mail::RFC822::Address

Argh... little did I realize that Paul Warren already implemented a RFC822 email address validation module (member of CPAN since March 2002).

Maybe I'll submit a diff to bring it up to the RFC2822, which obsoletes the RFC822 standard.

2.1 Implementation

RFC 2822's grammar and lexical scope is surprisingly complex, but is implemented with perl regular expressions rather easily, as shown below. I should admit that this code closely resembles this, which is almost "correct." The differences being the addition of floating white space, completion of the addr-spec specification, and fixing two typos (wrong hex encoding).

$no_ws_ctl      = qr/[\x01-\x08\x0b\x0c\x0e-\x1f\x7f]/;
$text           = qr/[\x01-\x09\x0b\x0c\x0e-\x7f]/; # |$obs-text)/;
$specials       = qr/[\(\)\<\>\[\]\:\;\@\\\,\.\"]/;

$quoted_pair    = qr/\\$text/; # |$obs-qp)/;

$atext          = qr/[A-Za-z0-9\!\#\$\%\&\'\*\+\-\/\=\?\^\_\`\{\|\}\~]/;
$dot_atom_text  = qr/$atext+(\.$atext+)*/; # 1*atext *("." 1*atext)
$dot_atom       = qr/\s*$dot_atom_text\s*/; 
$atom           = qr/\s*$atext\s*/;

$qtext          = qr/($no_ws_ctl|[\x21\x23-\x5b\x5d-\x7e])/;
$qcontent       = qr/($qtext|$quoted_pair)/; 
$quoted_string  = qr/\s*\"(\s*$qcontent)*\s*\"\s*/;

$dtext          = qr/($no_ws_ctl|[\x21-\x5a\x5e-\x7e])/;
$dcontent       = qr/($dtext|$quoted_pair)/; 
$domain_literal = qr/\s*\[(\s*$dcontent)*\s*\]\s*/;
$domain         = qr/($dot_atom|$domain_literal)/; # |$obs-domain)/;
$local_part     = qr/($dot_atom|$quoted_string)/; # |$obs-local-part)/;
$addr_spec      = qr/$local_part\@$domain/;

The declarations follow:

# RFC 2822 3.2.1 Lexical Tokens: Primitive Tokens
# NO-WS-CTL: US-ASCII control characters that do not include the
#            carriage return, line feed, and white space characters
# text:      Characters excluding CR and LF
my($no_ws_ctl, $text, $specials);

# RFC 2822 3.2.2 Lexical Tokens: Quoted Characters
my($quoted_pair);

# RFC 2822 3.2.3 Lexical Tokens: Folding white space and comments
# FWS:       Folding White Space
# ctext:     Non white space controls.  The rest of the US-ASCII
#            characters not including "(", ")", or "\"
my($FWS, $ctext, $ccontent, $comment, $CFWS);

# RFC 2822 3.2.4 Lexical Tokens: Atom
# atext:     Any character except controls, SP, and specials.
my($atext, $atom, $dot_atom, $dot_atom_text);

# RFC 2822 3.2.5 Lexical Tokens: Quoted Strings
# qtext:     Non white space controls.  The rest of the US-ASCII
#            characters not including "\" or the quote character
my($qtext, $qcontent, $quoted_string);

# RFC 2822 3.4.1 Addr-spec specification
# dtext:     Non white space controls.  The rest of the US-ASCII
#            characters not including "[", "]", or "\"
my($addr_spec, $local_part, $domain, $domain_literal, $dcontent, $dtext);

3.1 References