Previous Thread
Next Thread
Print Thread
INSTR regex bug #36342 11 Jul 23 02:32 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
The following is INSTR() is returns -94 in 6.5.1734.3, which works as expected in 6.5.1731.2 returning 3

It also does not work with 6.5.1733.0. I assume this was broken with the compiler templating changes.

Code
FUNCTION fn'prtinvadvo_get'options(link'printer'name as s0,active as s1,raryblob as x0) as f8
  active ="Y"
  Trace.Print " active=["+active+"] instr=["+INSTR(1,"[yY]",active,0)+"]"


Last edited by Stephen Funkhouser; 11 Jul 23 02:38 PM.

Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36345 11 Jul 23 04:47 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
Right - there was a shake up in the regex code. I'll figure it out shortly. (On a side note, there's actually a new version of the PCRE library implementation, but it's not upward compatible with the version we're using, making the conversion a bit messy. I started down that path anyway but then ran into some complications matching the Windows and Linux implementations and decided to put that off.)

Re: INSTR regex bug [Re: Stephen Funkhouser] #36346 11 Jul 23 07:01 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
The problem here is more of a design flaw than a simple bug. It has actually been around forever, but got much worse when the maximum number of precompiled patterns was increased from 20 to 128 (in order to accommodate more expansive transcopy operations). The issue is that in REGEX mode, a pattern consisting of a single byte was taken as a reference to the pre-compiled pattern indexed to that byte's ascii value. In your example, the pattern "Y" is being confused with precompiled pattern #59, which doesn't exist (hence the error -94 for no such precompiled pattern). Previously the confusion wouldn't have occurred because 59 is greater than the prior limit of 20.

It's tempting to just roll back the limit to 20 where it was before, but that doesn't truly solve the problem, since it's quite reasonable to search for a single control character, e.g. instr(1,s$,chr(13)).

The reason why this never came up before is a combination of: 1) searching for a control character is a lot less common than for a character like "Y"; and more importantly: 2) it doesn't really make sense to use REGEX search mode with a single character pattern. So it would be unlikely / unusual / unhelpful to specify the flags argument to INSTR() in such a case. (Nearly all regex patterns are multi-character, but things would really get confusing if you were searching for ".", which would match any character if the flags argument was specified.)

And unfortunately, 0, is simply the code for the default regex option set, as opposed to disabling it.

The simple fix for your particular example is to remove the hard-coded flags argument of 0, since presumably you don't need or want regex matching with a single character pattern, But in general, a user-defined function wrapper for instr() that allows the flags argument to be passed in is going to have this problem, since there is no explicit flag to disable regex mode. I suppose we could add one, but it's unclear how practical that would be, since it would require changing existing code to take advantage of it, and if you're going to do that, you could just as easily work around the issue in another way, such as by adding your own boolean parameter to control whether to pass 3 or 4 args to instr().

If you have any suggestions on the best way to climb out of this hole, I'm all ears. (or eyes).



Last edited by Jack McGregor; 11 Jul 23 07:04 PM.
Re: INSTR regex bug [Re: Stephen Funkhouser] #36347 11 Jul 23 07:22 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
The pattern is "[yY]" to match either lower-case or upper-case. Not a single character partner; hence, the regex flag is enabled.

I'm not sure I understand how/why the INSTR() with regex is using any of the precompiled pattern slots? Maybe it's necessary because of how the PCRE library is designed.

Without further insight I don't have any ideas.

Last edited by Stephen Funkhouser; 11 Jul 23 07:22 PM.

Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36348 11 Jul 23 09:07 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
Code
FUNCTION fn'prtinvadvo_get'options(link'printer'name as s0,active as s1,raryblob as x0) as f8
  active ="Y"
  Trace.Print " active=["+active+"] instr=["+INSTR(1,"[yY]",active,0)+"]"

Actually, at least in terms of the doc ...
Quote

INSTR (spos, subject, pattern, {flags})
...I think "[yY]" is the subject here and active is the pattern. And it's a single character. It may be that you got the two parameters transposed, in which case the problem goes away. Here at least, but not in general.

Admittedly, it was probably an unnecessary 'feature' to allow INSTR() to tap into precompiled REGEX patterns. It ended up that way mainly because it plugs directly into the REGEX.SBR interface, which, for better or worse, uses this shortcut to distinguish new patterns from precompiled pattern numbers.

It's a bit of a hack, but as a practical matter, I could add some logic to to the four-parameter version if INSTR, i.e. the REGEX version, to revert back to the standard non-REGEX version if the pattern is a single byte, and instead require the programmer to specify a new flag to force the single-byte pattern to be interpreted as a precompiled pattern index.

The downside is that it would break any existing properly written and functioning program that for whatever reason, is taking advantage of pre-compiled patterns with INSTR.

Re: INSTR regex bug [Re: Stephen Funkhouser] #36349 11 Jul 23 09:36 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
I transposed those 12 years ago...Whoops!

That will resolve this in this function, but I'm guessing this issue might come back to bite someone. Tricky one to solve I'm sure.

Is it possible to not use precompiled patterns? If so, would that fix this?

Last edited by Stephen Funkhouser; 11 Jul 23 09:42 PM.

Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36350 11 Jul 23 09:46 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
That's funny (I didn't start making mistakes like that until maybe 11 years ago) laugh

Oddly enough, it works both ways as long as your active (pattern) parameter was just one character. (Except that it would allow "[" and "]", but that probably wasn't a very likely use case).

So, maybe the error was actually a good thing.

Still, I should probably add a flag or at least a warning to the documentation to clear up the ambiguity. But maybe it's not quite urgent.

Re: INSTR regex bug [Re: Stephen Funkhouser] #36351 11 Jul 23 10:15 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
I can only find this one function in our code affected by this. So, probably not urgent.

Thanks for the followup.

Last edited by Stephen Funkhouser; 11 Jul 23 10:15 PM.

Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36865 08 Dec 23 10:43 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
I came across this issue again. We have a boolean search string parsing module, and we found that using it with a string like "t post red" causes the first token "t' to fail when compared to a possible match string.

When token$ = "t"
Using INSTR(spos, subject$, token$, PCRE_CASELESS) returns -94.

I have no idea what to suggest, as I don't fully understand the precompile pattern issue. Just wanted to make sure this doesn't get forgotten.

Changing to .fn = INSTR(spos, UCS(subject$), UCS(token$)), will probably be fine in this case because the use of REGEX within a user provided pattern token is exceedingly rare.

Last edited by Stephen Funkhouser; 08 Dec 23 10:45 PM.

Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36866 08 Dec 23 11:40 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
I added the PCREX_NOT_PRECOMPILED symbol to regex.def with the idea that it could be used when specifying a single-char pattern to distinguish between a true pattern and a numeric reference to a precompiled pattern #. But I failed to follow through and automatically set it for INSTR() because I was afraid to break some existing code.

And unfortunately, the documentation for INSTR explicitly talks about not only using precompiled patterns but even compiling them...
Quote
Precompiling Patterns: Although the 20 numbered precompiled patterns (1-20) may be shared between REGEX.SBR and INSTR(), and thus you could use REGEX.SBR to pre-compile patterns later used with INSTR(), if you prefer to use INSTR() itself to pre-compile patterns, set the PCREX_PRECOMPILE (&h80000000) bit in the FLAGS parameter and set STPOS to the pattern number (like PATNO in REGEX). The SUBJECT string will be ignored (can be ""). On success, the return value of the function will equal STPOS; else refer to the STATUS codes listed under REGEX.)

Using Precompiled Patterns: As with REGEX.SBR, you can specify a precompiled pattern by setting PATTERN = CHR$(n) where n is the pattern number (1-20). As with REGEX.SBR, whenever the same pattern is used in consecutive calls, the previously compiled version is automatically used (making pre-compilation unnecessary except when alternating between multiple patterns).

But as it stands, you'd have to set that flag in all your INSTR() calls to make sure that any single-char pattern you pass isn't treated as a precompiled pattern. I agree that it isn't an ideal solution, since it would require you tracking down all your INSTR() references and adding the flag. (Or, as long as you used the PCRE_xxx symbols and not literal flag values, I suppose you could edit your regex.def file to add that bit to all the flags, in which case you'd just need to recompile. But that would then make it difficult to actually use precompiled patterns with XCALL REGEX.

The only other fix that wouldn't involve changing code and wouldn't risk breaking existing code would be to introduce a MIAME.INI flag (OPTIONS=NOINSTRPC?) that forced INSTR() to apply the new PCREX_NOT_PRECOMPILED flag. That seems a bit over the top though. Yet another option would be to accept a small risk of breaking code and limit the use of precompiled patterns in INSTR() to just, say, chr(1) thru chr(5). Or, maybe someone has another suggestion?

Re: INSTR regex bug [Re: Stephen Funkhouser] #36867 11 Dec 23 04:16 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content OP
Member
OP Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
We haven't used pre-compiled patterns with REGEX or INSTR(), so automatically disabling them would be fine for us. Can't speak for other developers though.

Most of our code does not report any INSTR() errors, not sure why that is other than it's quite verbose to add that code everywhere.

This does feel like a pretty big trap to fall into with what could be no message as to why the INSTR() isn't matching.

On a side note, how much of a performance improvement would pre-compiled patterns be?


Stephen Funkhouser
Diversified Data Solutions
Re: INSTR regex bug [Re: Stephen Funkhouser] #36868 11 Dec 23 06:29 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
OK, here's the plan:

1) I've already limited (as of 1752.4) the interpretation of single-character patterns in INSTR() as being pre-compiled pattern identifiers (rather than normal patterns) to just ASCII 1 thru 5. That allows up to 5 pre-compiled patterns (probably several times more than any one uses) and limits the possible confusion to just cases where you really were looking for chr(1) thru chr(5). (Note that in those cases, I can't think of any advantage to Regex, but obviously if you're using a common wrapper for all your INSTR() searches, you might be invoking Regex indiscriminately.)

2) I'll put on the to-do list to add a MIAME.INI OPTION to disable pre-compiled patterns entirely. (Unfortunately that will require inventing yet another error code for an attempt to precompile a pattern when they've been disabled, and opens up the possibility of ignoring and mis-diagnosing the problem way down the road.)

As for the benefits of pre-compilation, that's a good question. I guess I thought it was important, since it is a significant aspect of the PCRE library. In fact, pre-compilation is required, but it's handled internally by INSTR() and REGEX.SBR. Compiling a pattern is like compiling a program -- it reduces it to a format that is easier and more efficient to process at runtime, and also gets most of the error-handling logic out of the way. But I've never tried to quantify it. And obviously it would depend on the complexity. And the most common situation of a using a single pattern against a lot of subject strings is automatically covered -- both INSTR and REGEX automatically remember and re-use the pre-compiled pattern if the actual pattern doesn't change from one call to the next. So the situation would only arise if you were running a bunch of subject strings against multiple patterns.

I guess this would be a good test for RUNPROF. I'll see if I can come up with a couple good patterns and gets some numbers for you.


Moderated by  Jack McGregor, Ty Griffin 

Powered by UBB.threads™ PHP Forum Software 7.7.3