PERL BEGINNERS 14 REGEX PATTERN
Subject: Regex Pattern
Date: Thu, 7 Aug 2003 18:04:15 -0600

From: demo@no-spam (Trevor Morrison)

HI,

I am trying to use regex to extract the city,state and zip out of a file.
Now, the problem is the city can have more then one word to it like
San Antonio
San Francisco
etc,

I have also, bumped into case of 4 or 5 words in the city name! I am looking for a regex expression that will take this into account knowing that the form of the line is:

city[space]state(2 letters)[space] zip (could be 5 or 9 digits)

An example:

Loves Park IL 61111
or APO AE NY 09012
or St. George UT 84770
or Columbus OH 43202
or Salt Lake City UT 84118

TIA
Trevor

Date: Thu, 7 Aug 2003 20:19:22 -0400 (EDT)

Subject: Re: Regex Pattern
From: japhy@no-spam (Jeff 'Japhy' Pinyan)
On Aug 7, Trevor Morrison said:

>I am trying to use regex to extract the city,state and zip out of a file.
>Now, the problem is the city can have more then one word to it like >
>San Antonio >San Francisco >
>I have also, bumped into case of 4 or 5 words in the city name! I am >looking for a regex expression that will take this into account knowing that >the form of the line is:
>
>city[space]state(2 letters)[space] zip (could be 5 or 9 digits)
>
>Loves Park IL 61111
>APO AE NY 09012
>St. George UT 84770
>Columbus OH 43202
>Salt Lake City UT 84118

The simplest way is to ignore what form the city might take, and deal with the knowns -- the state will be two uppercase letters, and the zip code will be five to nine digits.

my ($city, $state, $zip) = $line =~ /^(.*) ([A-Z]{2}) (\d{5,9})$/;

This assumes the fields are separated by a space. It also only checks for AT LEAST 4 and AT MOST 9 digits, so it would let a 7-digit zip code through. If you want to be more robust, the last part of the regex could be
(\d{5}(?:\d{4})?)

which ensures 5 digits, and then optionally matches 4 more. Then again,
maybe just
(\d{9}|\d{5})

is simpler on the eyes and brain.

Another approach, if you don't really care about the format of the lines,
and you just want to extract the fields regardless of HOW they look, is to reverse the line, split it into three pieces, and then reverse each of those pieces. A line like "Salt Lake City UT 54321" would be reversed to "12345 TU ytiC ekaL tlaS", split into three pieces ("12345", "TU", and "ytiC ekaL tlaS"), and then each of those pieces would be reversed again,
to give back "54321", "UT", and "Salt Lake City".

my ($zip, $state, $city) =
map { scalar reverse } # reverse (in scalar context) (it's important)
split ' ', (reverse $line), 3;

But maybe that's more than you need.

-- Jeff "japhy" Pinyan japhy@no-spam http://www.pobox.com/~japhy/
RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
[ I'm looking for programming work. If you like my work, let me know. ]


Subject: Re: Regex Pattern
Date: Fri, 08 Aug 2003 00:48:42 +0000

From: nomail@no-spam (Freddo)
Hello Trevor,

On Friday 08 August 2003 00:04, Trevor Morrison wrote:
> HI,
> > I am trying to use regex to extract the city,state and zip out of a file.
> Now, the problem is the city can have more then one word to it like > ... > I have also, bumped into case of 4 or 5 words in the city name! I am > looking for a regex expression that will take this into account knowing > that the form of the line is:

Just a try:

--- SoF testcities.pl #!/usr/bin/perl
while (<DATA>) {
chomp;
$revString = reverse;
($revZip, $revCode, $revName) = ( $revString =~ /(\d+)\s+(.*?)\s+(.*)$/ );


$zip = reverse $revZip;
$code = reverse $revCode;
$name = reverse $revName;

print "$zip - $name ($code)\n";
}

__DATA__
Loves Park IL 61111
APO AE NY 09012
St. George UT 84770
Columbus OH 43202
Salt Lake City UT 84118
--- EoF testcities.pl
Prints:

61111 Loves Park (IL)
09012 APO AE (NY)
84770 St. George (UT)
43202 Columbus (OH)
84118 Salt Lake City (UT)

fyi, there's a very nice text from japhy about 'sexeger (or Reverse Regular Expressions) here: http://japhy.perlmonk.org/sexeger/sexeger.html
I hope this helps...
freddo

Subject: Re: Regex Pattern
Date: Thu, 07 Aug 2003 17:53:56 -0700

From: dzhuo@no-spam (David)
Trevor Morrison wrote:

> HI,
> > I am trying to use regex to extract the city,state and zip out of a file.
> Now, the problem is the city can have more then one word to it like > > San Antonio > > San Francisco > > etc,
> > I have also, bumped into case of 4 or 5 words in the city name! I am > looking for a regex expression that will take this into account knowing > that the form of the line is:
> > city[space]state(2 letters)[space] zip (could be 5 or 9 digits)
> > An example:
> > Loves Park IL 61111
> or > APO AE NY 09012
> or > St. George UT 84770
> or > Columbus OH 43202
> or > Salt Lake City UT 84118
> > > TIA > > Trevor
try:

#!/usr/bin/perl -w
use strict;

while(<>){
chomp;
my($city,$state,$zip) = /(.+)\s+(\S+)\s+(\S+)/;
print pack("A7","city:"),$city,"\n";
print pack("A7","State:"),$state,"\n";
print pack("A7","Zip:"),$zip,"\n";
}

__END__

it doesn't really meet your "requirement" since it also matches:

Salt Lake City Ut. 84118-12345

but you might consider that to be a valid US address.

david

Subject: Re: Regex Pattern
Date: Sat, 09 Aug 2003 15:56:55 +0200

From: bigj@no-spam (Janek Schleicher)
Jeff 'Japhy' Pinyan wrote at Thu, 07 Aug 2003 20:19:22 -0400:

> my ($city, $state, $zip) = $line =~ /^(.*) ([A-Z]{2}) (\d{5,9})$/;
> > This assumes the fields are separated by a space. It also only checks for > AT LEAST 4 and AT MOST 9 digits, so it would let a 7-digit zip code > through. If you want to be more robust, the last part of the regex could > be > > (\d{5}(?:\d{4})?)
> > which ensures 5 digits, and then optionally matches 4 more. Then again,
> maybe just > > (\d{9}|\d{5})
> > is simpler on the eyes and brain.
> > Another approach, if you don't really care about the format of the lines,

Why than not the very simple
my ($city, $state, $zip) = $line =~ /(.*) (\w+) (\d+)/;

The ^ and $ aren't necessary as Perl is greedy and the \w+, \d+ are enogh to get the data (also a .* would be enough)

And otherwise, instead of a "crypting" reverse of reverse solution,
I still would prefer to write something like
my @no-spam = split ' ', $line;
my $zip = join ' ', @no-spam my ($state, $zip) = @no-spam
what could also be shortcutted to 2 lines :-)

Greetings,
Janek