PERL PERL5 PORTERS 1 RE TRIE OPTIMISATION OF SIMPLE ALTERNATIONS FOR BLEAD PERL
Subject: Re: Trie optimisation of simple alternations for blead perl.
Date: Tue, 15 Feb 2005 13:11:21 -0800

From: jm@no-spam (Justin Mason)


Date: Tue, 15 Feb 2005 23:17:16 +0000

Subject: Re: Trie optimisation of simple alternations for blead perl.






















From: nick@no-spam (Nicholas Clark)
On Tue, Feb 15, 2005 at 01:11:21PM -0800, Justin Mason wrote:

> Plus (as far as I know) none of us have ever built bleadperl ;)
> I can try it out if there's a tarball though, and give some > rough ideas of the speedup without our code changes?

Giving an idea of speedup [or not :-(] would be useful. I don't think that there are any recent tarballs - everyone tends to get the current source from the rsync server. Something like
rsync --delete -avz rsync://ftp.linux.activestate.com/perl-current/ ~/Perl/bleadperl-src/


Nicholas Clark

Date: Wed, 16 Feb 2005 16:25:57 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl. (Updated
From: david@no-spam (David Landgren)
demerphq wrote:
> On Tue, 15 Feb 2005 13:11:21 -0800, Justin Mason <jm@no-spam> wrote:
> >>-----BEGIN PGP SIGNED MESSAGE-----
>>Hash: SHA1
>>
>>Yitzchak Scott-Thoennes writes:
>>
>>>On Tue, Feb 15, 2005 at 12:40:50PM -0800, Justin Mason wrote:
>>>
>>>>FWIW, this looks like it'd be excellent for SpamAssassin ;)
>>>
>>>Maybe somebody with SpamAssassin could do some benchmarks?
>>

[...]

> I did however do some benchmarks on the perfomance of one particular > regex in the SA suite. Its from SA::Bayes and looks like this:
> > ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
> m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
> t(?:his|he|ime|hrough|hat)|
> w(?:hy|here|ork|orld|ith|ithout|eb)|
> f(?:rom|or|ew)| e(?:ach|ven|mail)|
> o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
> s(?:uch|ame)| l(?:ook|ike|ong)|
> y(?:ou|our|ou're)|
> The|has|have|into|using|http|see|It's|it's|
> number|just|both|come|years|right|know|already|
> people|place|first|because|
> And|give|year|information|can)$/x);
> > As you can see this regex has been PreSuf'ed for a minor speedup. Run > times of this regex being used to search regcomp.c (a large file) in a > while /\b(RE)\b/ loop were prepatch: 21/s postpatch: 51/s. When the > pre-suf in the pattern is unrolled the times were prepatched:10/s > postpatched:100/s. (Note this is the number of times to extract all > of the matches from the file.)

If you're benching, can you compare how your trie version (out)?performs regexps produced by Regexp::Optimizer and Regexp::Assemble on the above list of words?

Regexp::Assemble produces the following RE on the list of above words:

(?:m(?:a(?:il(?:ing|to)?|[dk]e)|o(?:re|st)|uch)|
w(?:ith(?:out)?|h(?:ere|y)|or(?:ld|k)|eb)|
a(?:l(?:ready|l)|(?:bl|r)e|n[dy])|t(?:h(?:rough|at|is|e)|ime)|
i(?:n(?:formation|to)|t's)|o(?:n(?:ly|e)|ff|ut|wn)|
y(?:ou(?:'re|r)?|ears?)|(?:p(?:eopl|lac)|giv)e|
n(?:o[tw]|umber|eed)|f(?:irst|rom|ew|or)|l(?:o(?:ng|ok)|ike)|
h(?:a(?:ve|s)|ttp)|s(?:(?:am|e)e|uch)|e(?:mail|ach|ven)|
b(?:ecause|oth)|(?:righ|jus)t|c(?:ome|an)|using|It's|know|And)

Regexp::Optimizer produces:

(?:a(?:n[dy]|l(?:l|ready)|(?:bl|r)e)|m(?:o(?:st|re)|
a(?:il(?:ing|to)?|[dk]e)|uch)|t(?:h(?:is|e|rough|at)|ime)|
w(?:h(?:y|ere)|or(?:k|ld)|ith(?:out)?|eb)|f(?:rom|or|ew|irst)|
e(?:ach|ven|mail)|o(?:n(?:e|ly)|ff|wn|ut)|n(?:o[wt]|eed|umber)|
s(?:uch|(?:am|e)e)|l(?:o(?:ok|ng)|ike)|y(?:ou(?:r|'re)?|ears?)|
h(?:a(?:s|ve)|ttp)|i(?:n(?:to|formation)|t's)|b(?:oth|ecause)|
c(?:ome|an)|p(?:eopl|lac)e|using|It's|(?:jus|righ)t|know|And|give)

(word-wrapped to avoid mangling, should be on one line).

David

Date: Thu, 17 Feb 2005 19:49:59 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl. (Updated patch, supports /i modifier)
From: demerphq@no-spam (Demerphq)
On Wed, 16 Feb 2005 16:25:57 +0100, David Landgren <david@no-spam> wrote:
> demerphq wrote:
> > On Tue, 15 Feb 2005 13:11:21 -0800, Justin Mason <jm@no-spam> wrote:
> >
> >>-----BEGIN PGP SIGNED MESSAGE-----
> >>Hash: SHA1
> >>
> >>Yitzchak Scott-Thoennes writes:
> >>
> >>>On Tue, Feb 15, 2005 at 12:40:50PM -0800, Justin Mason wrote:
> >>>
> >>>>FWIW, this looks like it'd be excellent for SpamAssassin ;)
> >>>
> >>>Maybe somebody with SpamAssassin could do some benchmarks?
> >>
> > [...]
> > > I did however do some benchmarks on the perfomance of one particular > > regex in the SA suite. Its from SA::Bayes and looks like this:
> >
[...]
> > If you're benching, can you compare how your trie version (out)?performs > regexps produced by Regexp::Optimizer and Regexp::Assemble on the > above list of words?

Yeah ive put together a benchmark program to test my current build against blead. It adds whitespace and the /x modifier to the regexes to make them easier to read. It tests the same list of words (sorted however) as was in the SA module with and without \b wrappers as constructed by a bunch of the Regexp modules. If anybody is interested in seeing the code they should ask off list, but the underlying benchmark is:
my $found=0; $found++ while $text=~/$re{$name}/g;

Where $text is my local regcomp.c (just because its a large file)

Here is the output:
----------------------------------------------------
used the following Regex modules: Regexp::List, Regexp::Assemble,
Regexp::Optimizer, Regex::PreSuf Read 181168 chars from '\blead_trie\regcomp.c'
Got 74 words to match Testing regexes:
Regexp 'orig' :
(?x-ism:(
(?: a (?: nd | ny | ble | ll | re )| m (?: uch | ost | ade | ore |
ail | ake | ailing | any | ailto )| t (?: his | he | ime | hrough |
hat )| w (?: hy | here | ork | orld | ith | ithout | eb )| f (?:
rom | or | ew )| e (?: ach | ven | mail )| o (?: ne | ff | nly | wn | ut )| n (?: ow | ot | eed )| s (?: uch | ame )| l (?: ook | ike |
ong )| y (?: ou | our | ou're )| The | has | have | into | using |
http | see | It's | it's | number | just | both | come | years |
right | know | already | people | place | first | because | And |
give | year | information | can )
))
Regexp 'RA' [Regexp::Assemble] :
(?x-ism:(
(?: m (?: a (?: il (?: ing | to )?| [dk]e | ny )| o (?: re | st )|
uch )| w (?: ith (?: out )?| h (?: ere | y )| or (?: ld | k )| eb )| a (?: l (?: ready | l )|(?: bl | r ) e | n[dy] )| t (?: h (?:
rough | at | is | e )| ime )| i (?: n (?: formation | to )| t's )|(?: p (?: eopl | lac )| giv | Th ) e | o (?: n (?: ly | e )| ff |
ut | wn )| y (?: ou (?: 're | r )?| ears ?)| n (?: o[tw] | umber |
eed )| f (?: irst | rom | ew | or )| l (?: o (?: ng | ok )| ike )|
h (?: a (?: ve | s )| ttp )| s (?:(?: am | e ) e | uch )| e (?:
mail | ach | ven )| b (?: ecause | oth )|(?: righ | jus ) t | c (?:
ome | an )| using | It's | know | And )
))
Regexp 'RL' [Regexp::List] :
(?x-ism:(
(?= [AITabcefghijklmnoprstuwy] )(?: a (?: l (?: l | ready )| n[dy]
|(?: bl | r ) e )| b (?: ecause | oth )| c (?: an | ome )| e (?:
ach | mail | ven )| f (?: ew | irst | or | rom )| h (?: a (?: s |
ve )| ttp )| i (?: n (?: formation | to )| t\'s )| l (?: o (?: ng |
ok )| ike )| m (?: a (?: il (?: ing | to )?| [dk]e | ny )| o (?: re | st )| uch )| n (?: o[tw] | eed | umber )| o (?: n (?: e | ly )|
ff | ut | wn )| p (?: eopl | lac ) e | s (?:(?: am | e ) e | uch )|
t (?: h (?: at | e | is | rough )| ime )| w (?: h (?: ere | y )|
ith (?: out )?| or (?: k | ld )| eb )| y (?: ears ?| ou (?: \'re |
r )?)| And | It\'s |(?: Th | giv ) e |(?: jus | righ ) t | know |
using )
))
Regexp 'RO' [Regexp::Optimizer] :
(?x-ism:(
(?: a (?: l (?: l | ready )| n[dy] |(?: bl | r ) e )| b (?: ecause | oth )| c (?: an | ome )| e (?: ach | mail | ven )| f (?: ew |
irst | or | rom )| h (?: a (?: s | ve )| ttp )| i (?: n (?:
formation | to )| t's )| l (?: o (?: ng | ok )| ike )| m (?: a (?:
il (?: ing | to )?| [dk]e | ny )| o (?: re | st )| uch )| n (?:
o[tw] | eed | umber )| o (?: n (?: e | ly )| ff | ut | wn )| p (?:
eopl | lac ) e | s (?:(?: am | e ) e | uch )| t (?: h (?: at | e |
is | rough )| ime )| w (?: h (?: ere | y )| ith (?: out )?| or (?:
k | ld )| eb )| y (?: ears ?| ou (?: 're | r )?)| And | It's |(?:
Th | giv ) e |(?: jus | righ ) t | know | using )
))
Regexp 'RPS' [Regex::PreSuf] :
(?x-ism:(
(?: And | It\'s | The | a (?: ble | l (?: ready | l )| n[dy] | re )| b (?: ecause | oth )| c (?: an | ome )| e (?: ach | mail | ven )| f (?: ew | irst | or | rom )| give | h (?: a (?: ve | s )| ttp )| i (?: n (?: formation | to )| t\'s )| just | know | l (?: ike |
o (?: ng | ok ))| m (?: a (?: de | il (?: ing | to )?| ke | ny )| o (?: re | st )| uch )| n (?: eed | o[tw] | umber )| o (?: ff | n (?:
ly | e )| ut | wn )| p (?: eopl | lac ) e | right | s (?: ame | ee | uch )| t (?: h (?: at | is | rough | e )| ime )| using | w (?: eb | h (?: ere | y )| ith (?: out )?| or (?: ld | k ))| y (?: ears ?|
ou (?: \'re | r )?))
))
Regexp 'smpl' :
(?x-ism:(
And | It's | The | able | all | already | and | any | are | because | both | can | come | each | email | even | few | first | for |
from | give | has | have | http | information | into | it's | just | know | like | long | look | made | mail | mailing | mailto | make | many | more | most | much | need | not | now | number | off | one | only | out | own | people | place | right | same | see | such |
that | the | this | through | time | using | web | where | why |
with | without | work | world | year | years | you | you're | your ))
Running tests for 2 seconds on 2 perls:
Pfx |Perl ----+--------------------------
A |g:\perl_trie\bin\perl.exe B |g:\perl_blead\bin\perl.exe ----+--------------------------
Rate B_smpl B_orig B_RO B_RA B_RPS A_RA B_RL A_RO A_RPS A_orig A_RL A_smpl B_smpl 2.87/s -- -38% -43% -44% -58% -58% -61% -64% -65% -75% -77% -89%
B_orig 4.64/s 62% -- -7% -10% -31% -33% -37% -42% -43% -59% -62% -82%
B_RO 4.99/s 74% 7% -- -3% -26% -28% -32% -37% -39% -56% -60% -81%
B_RA 5.15/s 80% 11% 3% -- -24% -25% -30% -35% -37% -55% -58% -80%
B_RPS 6.75/s 136% 45% 35% 31% -- -2% -8% -15% -17% -41% -45% -74%
A_RA 6.89/s 140% 48% 38% 34% 2% -- -6% -13% -15% -39% -44% -73%
B_RL 7.34/s 156% 58% 47% 42% 9% 7% -- -8% -10% -35% -41% -71%
A_RO 7.95/s 177% 71% 59% 54% 18% 15% 8% -- -2% -30% -36% -69%
A_RPS 8.13/s 184% 75% 63% 58% 20% 18% 11% 2% -- -28% -34% -68%
A_orig 11.4/s 296% 145% 128% 120% 68% 65% 55% 43% 40% -- -8% -56%
A_RL 12.4/s 331% 166% 148% 140% 83% 80% 68% 56% 52% 9% -- -52%
A_smpl 25.7/s 795% 453% 414% 398% 280% 273% 250% 223% 216% 126% 108% --

Testing with /\b(PAT)\b/
Pfx |Perl ----+--------------------------
A |g:\perl_trie\bin\perl.exe B |g:\perl_blead\bin\perl.exe ----+--------------------------
Rate B_smpl B_orig B_RO B_RA A_RPS B_RPS A_RA A_RO B_RL A_orig A_RL A_smpl B_smpl 11.9/s -- -38% -43% -43% -48% -56% -58% -63% -69% -74% -76% -88%
B_orig 19.3/s 62% -- -7% -7% -15% -28% -32% -40% -50% -57% -62% -80%
B_RO 20.7/s 74% 7% -- -0% -9% -23% -27% -36% -46% -54% -59% -79%
B_RA 20.8/s 75% 8% 0% -- -9% -23% -27% -36% -46% -54% -59% -79%
A_RPS 22.7/s 91% 18% 10% 10% -- -15% -20% -30% -41% -50% -55% -77%
B_RPS 26.8/s 126% 39% 29% 29% 18% -- -6% -17% -30% -41% -47% -73%
A_RA 28.5/s 139% 47% 37% 37% 25% 6% -- -12% -26% -37% -43% -71%
A_RO 32.4/s 172% 68% 56% 56% 42% 21% 14% -- -15% -28% -36% -67%
B_RL 38.2/s 222% 98% 84% 84% 68% 42% 34% 18% -- -16% -24% -61%
A_orig 45.3/s 281% 135% 118% 118% 99% 69% 59% 40% 18% -- -10% -54%
A_RL 50.4/s 324% 161% 143% 143% 122% 88% 77% 56% 32% 11% -- -49%
A_smpl 98.9/s 732% 413% 377% 376% 335% 269% 248% 206% 159% 119% 96% --

-- First they ignore you, then they laugh at you, then they fight you,
then you win.
+Gandhi