On Tue, Feb 15, 2005 at 08:19:10PM +0000, Nicholas Clark wrote:
> I think that the problem is that the two people best qualified to comment on
> this (Dave and Hugo) are both busy at the moment.
Oy, I seem to have been promoted in my absence!
I know practically nothing about the internals of the regex engine - I'm
just tinkering with the outside edges of it as regards the (?{...}) stuff.
Dave
(who's just finished lighting Othello, and who will be lighting Carpe
Jugulum in three weeks' time, and who is still wondering how he was
volunteered for both)
--
"Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."
On Tue, Feb 15, 2005 at 09:50:13PM +0000, Dave Mitchell wrote:
> (who's just finished lighting Othello, and who will be lighting Carpe
> Jugulum in three weeks' time, and who is still wondering how he was
> volunteered for both)
By not following Rincewind's advice? (oh. you said Carpe Jugulum, not Mort)
Nicholas Clark
Justin Mason wrote:
> FWIW, this looks like it'd be excellent for SpamAssassin ;)
>
> I haven't had much time to look over the implementation, and I'm not
> really any use for reviewing it from a p5p POV due to lack of familiarity
> with perl internals, but the benchmark figures look fantastic and the
> implementation details sound good.
I was waiting for Hugo's advice on this patch.
Also, it would be nice to have something that works with /i, if that can
be done reliably.
> I'd love to see this get into perl, even if just as an option enabled
> through a "use" pragma. (in my opinion, if your regexp will benefit
> from a trie, you will know that in advance.)
or a new end-of-regexp switch. But if it works well I don't see why
this shouldn't be optimized by default (although avoiding slowdowns
in some cases would even be better -- do they affect regexp compilation
or execution ?)
[re cc'ing p5p]
demerphq wrote:
> On Wed, 16 Feb 2005 16:30:47 +0100, Rafael Garcia-Suarez
> > I was waiting for Hugo's advice on this patch.
> > Also, it would be nice to have something that works with /i, if that can
> > be done reliably.
>
> Its done. I hope reliably too, but there is this annoying segfault
> with SA that i need to figure out. While it does appear to be related
> to my code im a bit pushed to come up with an explanation for why
> slight changes to code around the regex could change the segfault
> behaviour. (Note i cant get SA to segfault when i add debug text, and
> SA is the only thing ive tested against that has problems, and the
> code that has problems is evaled closures (possibly inside of a fork
> on win32) so it could be almost anything.)
That's what I was alluding to.
> To be honest i would really love it if someone with stronger perl guts
> and stonger C skills than me could give the code a good once over.
> Even if whoever does it knows nothing about the regex engines
> particulars a sanity check would be really appreciated.
/me has to look at it more thoroughly then :)
> > > I'd love to see this get into perl, even if just as an option enabled
> > > through a "use" pragma. (in my opinion, if your regexp will benefit
> > > from a trie, you will know that in advance.)
> >
> > or a new end-of-regexp switch. But if it works well I don't see why
> > this shouldn't be optimized by default (although avoiding slowdowns
> > in some cases would even be better -- do they affect regexp compilation
> > or execution ?)
>
> Ok, in comp sci terms the cost of construction the trie is
> proportional to the number of characters in the trie. So IMO the cost
> is negligable. In what is the common case (a short list of short
> words of relatively few unique characters) you probably wouldnt even
> notice the slowdown as the overall match time would improve more than
> the construct time degrades.
Cool.
> In human terms: there is a slight slowdown of compilation, that is
> more than made up with the execution time differences. IMO most times
> the optimisation will kick in the size of the alternation being
> optimized will be so small as to have negligable construction costs.
> The only time I could see the construction costs being an issue would
> be with things like:
>
> "a"=~/a|very|extremely|long|list|of|words|that|could|match|but|the|first|one|will|always|match|so|the|regex|is|a|bit|of|a|waste|anyway/;
>
> The trouble is the only way we can "opt-out" of the optimisation
> sensibly is if we know what string we are matching against at the
> compile time of the regex, since the regex engine has no access to
> such knowledge we cant really do this.
>
> Anyway, the results I show are that even simple subpatterns benefit
> from conversion to a Trie and it happens automatically so i dont see
> the need of a switch except possibly to force the disabling of the
> optimisation for some reason. Im not really sure how that would work
> yet tho.
Agreeing.
> I really hope that some of you guys try the patch out. Some hands on
> feedback would be appreciated, especially as im on Win32 and have no
> real access to *nix right now.
At least on Linux, all tests pass.
On Wed, 16 Feb 2005 17:30:12 +0100, Rafael Garcia-Suarez
<rgarciasuarez@no-spam> wrote:
> [re cc'ing p5p]
Dangit, sorry about that. Still getting used to writing mails in gmail.
> demerphq wrote:
> > On Wed, 16 Feb 2005 16:30:47 +0100, Rafael Garcia-Suarez
> > > I was waiting for Hugo's advice on this patch.
> > > Also, it would be nice to have something that works with /i, if that can
> > > be done reliably.
> >
> > Its done. I hope reliably too, but there is this annoying segfault
> > with SA that i need to figure out. While it does appear to be related
> > to my code im a bit pushed to come up with an explanation for why
> > slight changes to code around the regex could change the segfault
> > behaviour. (Note i cant get SA to segfault when i add debug text, and
> > SA is the only thing ive tested against that has problems, and the
> > code that has problems is evaled closures (possibly inside of a fork
> > on win32) so it could be almost anything.)
>
> That's what I was alluding to.
Yeah fair enough. Im working on it though. :-)
>
> > To be honest i would really love it if someone with stronger perl guts
> > and stonger C skills than me could give the code a good once over.
> > Even if whoever does it knows nothing about the regex engines
> > particulars a sanity check would be really appreciated.
>
> /me has to look at it more thoroughly then :)
Thanks that would be really appreciated.
> > I really hope that some of you guys try the patch out. Some hands on
> > feedback would be appreciated, especially as im on Win32 and have no
> > real access to *nix right now.
>
> At least on Linux, all tests pass.
Is that the second patch i posted or the first Raphael?
(trie_5.patch.gz is the latest)
Cheers,
yves
--
First they ignore you, then they laugh at you, then they fight you,
then you win.
+Gandhi
demerphq wrote:
> > [re cc'ing p5p]
>
> Dangit, sorry about that. Still getting used to writing mails in gmail.
Damn automatic gmail reply-to, as well.
> > At least on Linux, all tests pass.
>
> Is that the second patch i posted or the first Raphael?
> (trie_5.patch.gz is the latest)
The second.