HI,
I am trying to use regex to extract the city,state and zip out of a file.
Now, the problem is the city can have more then one word to it like
San Antonio
San Francisco
etc,
I have also, bumped into case of 4 or 5 words in the city name! I am
looking for a regex expression that will take this into account knowing that
the form of the line is:
city[space]state(2 letters)[space] zip (could be 5 or 9 digits)
An example:
Loves Park IL 61111
or
APO AE NY 09012
or
St. George UT 84770
or
Columbus OH 43202
or
Salt Lake City UT 84118
TIA
Trevor
On Aug 7, Trevor Morrison said:
>I am trying to use regex to extract the city,state and zip out of a file.
>Now, the problem is the city can have more then one word to it like
>
>San Antonio
>San Francisco
>
>I have also, bumped into case of 4 or 5 words in the city name! I am
>looking for a regex expression that will take this into account knowing that
>the form of the line is:
>
>city[space]state(2 letters)[space] zip (could be 5 or 9 digits)
>
>Loves Park IL 61111
>APO AE NY 09012
>St. George UT 84770
>Columbus OH 43202
>Salt Lake City UT 84118
The simplest way is to ignore what form the city might take, and deal with
the knowns -- the state will be two uppercase letters, and the zip code
will be five to nine digits.
my ($city, $state, $zip) = $line =~ /^(.*) ([A-Z]{2}) (\d{5,9})$/;
This assumes the fields are separated by a space. It also only checks for
AT LEAST 4 and AT MOST 9 digits, so it would let a 7-digit zip code
through. If you want to be more robust, the last part of the regex could
be
(\d{5}(?:\d{4})?)
which ensures 5 digits, and then optionally matches 4 more. Then again,
maybe just
(\d{9}|\d{5})
is simpler on the eyes and brain.
Another approach, if you don't really care about the format of the lines,
and you just want to extract the fields regardless of HOW they look, is to
reverse the line, split it into three pieces, and then reverse each of
those pieces. A line like "Salt Lake City UT 54321" would be reversed to
"12345 TU ytiC ekaL tlaS", split into three pieces ("12345", "TU", and
"ytiC ekaL tlaS"), and then each of those pieces would be reversed again,
to give back "54321", "UT", and "Salt Lake City".
my ($zip, $state, $city) =
map { scalar reverse } # reverse (in scalar context) (it's important)
split ' ', (reverse $line), 3;
But maybe that's more than you need.
--
Jeff "japhy" Pinyan japhy@no-spam http://www.pobox.com/~japhy/
RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
[ I'm looking for programming work. If you like my work, let me know. ]
Hello Trevor,
On Friday 08 August 2003 00:04, Trevor Morrison wrote:
> HI,
>
> I am trying to use regex to extract the city,state and zip out of a file.
> Now, the problem is the city can have more then one word to it like
>
...
> I have also, bumped into case of 4 or 5 words in the city name! I am
> looking for a regex expression that will take this into account knowing
> that the form of the line is:
Just a try:
--- SoF testcities.pl
#!/usr/bin/perl
while (<DATA>) {
chomp;
$revString = reverse;
($revZip, $revCode, $revName) = ( $revString =~ /(\d+)\s+(.*?)\s+(.*)$/ );
$zip = reverse $revZip;
$code = reverse $revCode;
$name = reverse $revName;
print "$zip - $name ($code)\n";
}
__DATA__
Loves Park IL 61111
APO AE NY 09012
St. George UT 84770
Columbus OH 43202
Salt Lake City UT 84118
--- EoF testcities.pl
Prints:
61111 Loves Park (IL)
09012 APO AE (NY)
84770 St. George (UT)
43202 Columbus (OH)
84118 Salt Lake City (UT)
fyi, there's a very nice text from japhy about 'sexeger (or Reverse Regular
Expressions) here: http://japhy.perlmonk.org/sexeger/sexeger.html
I hope this helps...
freddo
Trevor Morrison wrote:
> HI,
>
> I am trying to use regex to extract the city,state and zip out of a file.
> Now, the problem is the city can have more then one word to it like
>
> San Antonio
>
> San Francisco
>
> etc,
>
> I have also, bumped into case of 4 or 5 words in the city name! I am
> looking for a regex expression that will take this into account knowing
> that the form of the line is:
>
> city[space]state(2 letters)[space] zip (could be 5 or 9 digits)
>
> An example:
>
> Loves Park IL 61111
> or
> APO AE NY 09012
> or
> St. George UT 84770
> or
> Columbus OH 43202
> or
> Salt Lake City UT 84118
>
>
> TIA
>
> Trevor
try:
#!/usr/bin/perl -w
use strict;
while(<>){
chomp;
my($city,$state,$zip) = /(.+)\s+(\S+)\s+(\S+)/;
print pack("A7","city:"),$city,"\n";
print pack("A7","State:"),$state,"\n";
print pack("A7","Zip:"),$zip,"\n";
}
__END__
it doesn't really meet your "requirement" since it also matches:
Salt Lake City Ut. 84118-12345
but you might consider that to be a valid US address.
david
Jeff 'Japhy' Pinyan wrote at Thu, 07 Aug 2003 20:19:22 -0400:
> my ($city, $state, $zip) = $line =~ /^(.*) ([A-Z]{2}) (\d{5,9})$/;
>
> This assumes the fields are separated by a space. It also only checks for
> AT LEAST 4 and AT MOST 9 digits, so it would let a 7-digit zip code
> through. If you want to be more robust, the last part of the regex could
> be
>
> (\d{5}(?:\d{4})?)
>
> which ensures 5 digits, and then optionally matches 4 more. Then again,
> maybe just
>
> (\d{9}|\d{5})
>
> is simpler on the eyes and brain.
>
> Another approach, if you don't really care about the format of the lines,
Why than not the very simple
my ($city, $state, $zip) = $line =~ /(.*) (\w+) (\d+)/;
The ^ and $ aren't necessary as Perl is greedy and the \w+, \d+ are enogh
to get the data (also a .* would be enough)
And otherwise, instead of a "crypting" reverse of reverse solution,
I still would prefer to write something like
my @no-spam = split ' ', $line;
my $zip = join ' ', @no-spam
my ($state, $zip) = @no-spam
what could also be shortcutted to 2 lines :-)
Greetings,
Janek