PERL PERL5 PORTERS 2 RE TRIE OPTIMISATION OF SIMPLE ALTERNATIONS FOR BLEAD PERL
Date: Tue, 15 Feb 2005 21:50:13 +0000

Subject: Re: Trie optimisation of simple alternations for blead perl.
From: davem@no-spam (Dave Mitchell)

On Tue, Feb 15, 2005 at 08:19:10PM +0000, Nicholas Clark wrote:
> I think that the problem is that the two people best qualified to comment on > this (Dave and Hugo) are both busy at the moment.

Oy, I seem to have been promoted in my absence!
I know practically nothing about the internals of the regex engine - I'm just tinkering with the outside edges of it as regards the (?{...}) stuff.

Dave
(who's just finished lighting Othello, and who will be lighting Carpe Jugulum in three weeks' time, and who is still wondering how he was volunteered for both)

-- "Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."


Date: Tue, 15 Feb 2005 21:59:05 +0000

Subject: Re: Trie optimisation of simple alternations for blead perl.





From: nick@no-spam (Nicholas Clark)
On Tue, Feb 15, 2005 at 09:50:13PM +0000, Dave Mitchell wrote:

> (who's just finished lighting Othello, and who will be lighting Carpe > Jugulum in three weeks' time, and who is still wondering how he was > volunteered for both)

By not following Rincewind's advice? (oh. you said Carpe Jugulum, not Mort)

Nicholas Clark

Date: Wed, 16 Feb 2005 16:30:47 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl.
From: rgarciasuarez@no-spam (Rafael Garcia-Suarez)
Justin Mason wrote:
> FWIW, this looks like it'd be excellent for SpamAssassin ;)
> > I haven't had much time to look over the implementation, and I'm not > really any use for reviewing it from a p5p POV due to lack of familiarity > with perl internals, but the benchmark figures look fantastic and the > implementation details sound good.

I was waiting for Hugo's advice on this patch.
Also, it would be nice to have something that works with /i, if that can be done reliably.

> I'd love to see this get into perl, even if just as an option enabled > through a "use" pragma. (in my opinion, if your regexp will benefit > from a trie, you will know that in advance.)

or a new end-of-regexp switch. But if it works well I don't see why this shouldn't be optimized by default (although avoiding slowdowns in some cases would even be better -- do they affect regexp compilation or execution ?)


Date: Wed, 16 Feb 2005 17:30:12 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl.
From: rgarciasuarez@no-spam (Rafael Garcia-Suarez)
[re cc'ing p5p]
demerphq wrote:
> On Wed, 16 Feb 2005 16:30:47 +0100, Rafael Garcia-Suarez > > I was waiting for Hugo's advice on this patch.
> > Also, it would be nice to have something that works with /i, if that can > > be done reliably.
> > Its done. I hope reliably too, but there is this annoying segfault > with SA that i need to figure out. While it does appear to be related > to my code im a bit pushed to come up with an explanation for why > slight changes to code around the regex could change the segfault > behaviour. (Note i cant get SA to segfault when i add debug text, and > SA is the only thing ive tested against that has problems, and the > code that has problems is evaled closures (possibly inside of a fork > on win32) so it could be almost anything.)

That's what I was alluding to.

> To be honest i would really love it if someone with stronger perl guts > and stonger C skills than me could give the code a good once over.
> Even if whoever does it knows nothing about the regex engines > particulars a sanity check would be really appreciated.

/me has to look at it more thoroughly then :)

> > > I'd love to see this get into perl, even if just as an option enabled > > > through a "use" pragma. (in my opinion, if your regexp will benefit > > > from a trie, you will know that in advance.)
> > > > or a new end-of-regexp switch. But if it works well I don't see why > > this shouldn't be optimized by default (although avoiding slowdowns > > in some cases would even be better -- do they affect regexp compilation > > or execution ?)
> > Ok, in comp sci terms the cost of construction the trie is > proportional to the number of characters in the trie. So IMO the cost > is negligable. In what is the common case (a short list of short > words of relatively few unique characters) you probably wouldnt even > notice the slowdown as the overall match time would improve more than > the construct time degrades.

Cool.

> In human terms: there is a slight slowdown of compilation, that is > more than made up with the execution time differences. IMO most times > the optimisation will kick in the size of the alternation being > optimized will be so small as to have negligable construction costs.
> The only time I could see the construction costs being an issue would > be with things like:
> > "a"=~/a|very|extremely|long|list|of|words|that|could|match|but|the|first|one|will|always|match|so|the|regex|is|a|bit|of|a|waste|anyway/;

> > The trouble is the only way we can "opt-out" of the optimisation > sensibly is if we know what string we are matching against at the > compile time of the regex, since the regex engine has no access to > such knowledge we cant really do this.
> > Anyway, the results I show are that even simple subpatterns benefit > from conversion to a Trie and it happens automatically so i dont see > the need of a switch except possibly to force the disabling of the > optimisation for some reason. Im not really sure how that would work > yet tho.

Agreeing.

> I really hope that some of you guys try the patch out. Some hands on > feedback would be appreciated, especially as im on Win32 and have no > real access to *nix right now.

At least on Linux, all tests pass.


Date: Wed, 16 Feb 2005 17:36:22 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl.
From: demerphq@no-spam (Demerphq)
On Wed, 16 Feb 2005 17:30:12 +0100, Rafael Garcia-Suarez <rgarciasuarez@no-spam> wrote:
> [re cc'ing p5p]

Dangit, sorry about that. Still getting used to writing mails in gmail.

> demerphq wrote:
> > On Wed, 16 Feb 2005 16:30:47 +0100, Rafael Garcia-Suarez > > > I was waiting for Hugo's advice on this patch.
> > > Also, it would be nice to have something that works with /i, if that can > > > be done reliably.
> >
> > Its done. I hope reliably too, but there is this annoying segfault > > with SA that i need to figure out. While it does appear to be related > > to my code im a bit pushed to come up with an explanation for why > > slight changes to code around the regex could change the segfault > > behaviour. (Note i cant get SA to segfault when i add debug text, and > > SA is the only thing ive tested against that has problems, and the > > code that has problems is evaled closures (possibly inside of a fork > > on win32) so it could be almost anything.)
> > That's what I was alluding to.

Yeah fair enough. Im working on it though. :-)
> > > To be honest i would really love it if someone with stronger perl guts > > and stonger C skills than me could give the code a good once over.
> > Even if whoever does it knows nothing about the regex engines > > particulars a sanity check would be really appreciated.
> > /me has to look at it more thoroughly then :)

Thanks that would be really appreciated.

> > I really hope that some of you guys try the patch out. Some hands on > > feedback would be appreciated, especially as im on Win32 and have no > > real access to *nix right now.
> > At least on Linux, all tests pass.

Is that the second patch i posted or the first Raphael?
(trie_5.patch.gz is the latest)

Cheers,
yves
-- First they ignore you, then they laugh at you, then they fight you,
then you win.
+Gandhi

Date: Wed, 16 Feb 2005 17:40:09 +0100

Subject: Re: Trie optimisation of simple alternations for blead perl.
From: rgarciasuarez@no-spam (Rafael Garcia-Suarez)
demerphq wrote:
> > [re cc'ing p5p]
> > Dangit, sorry about that. Still getting used to writing mails in gmail.

Damn automatic gmail reply-to, as well.

> > At least on Linux, all tests pass.
> > Is that the second patch i posted or the first Raphael?
> (trie_5.patch.gz is the latest)

The second.