Module:User:Erutuon/patterns
- The following documentation is located at Module:User:Erutuon/patterns/documentation. [edit] Categories were auto-generated by Module:documentation. [edit]
- Useful links: root page • root page’s subpages • links • transclusions • testcases • sandbox
Contains a function that determines if a pattern will behave in exactly the same way in both the basic Lua string functions (string.find, string.match, string.gsub, string.gmatch
) and the Ustring functions (mw.ustring.find, mw.ustring.match, mw.ustring.gsub, mw.ustring.gmatch
). This assumes text is validly encoded in UTF-8, the encoding used by MediaWiki.
Beware: the function tells you that a pattern requires the Ustring functions if it contains character classes ("%w, %s, %d, ..."
. But this is not always true. It depends what characters you actually need the character class to match. For example, if you're matching language codes, which only contain ASCII, then the character class "%a"
only needs to match ASCII alphabetic characters ("[A-Za-z]"
(not, for instance, Greek alphabetic characters "αβγδεζηθ..."
), and the basic string function can be used.
Usage
[edit]Don't use this in an actual template-invoked function to decide between string
and mw.ustring
on the fly. That would be very inefficient. If you want, use it to test a pattern and decide whether string
will work just as well as mw.ustring
in a given instance. Switching to string
saves a noticeable amount of processing time if the function involves a lot of pattern-matching.
local UstringNotNeeded = require("Module:User:Erutuon/patterns").canUseString(pattern_to_check)
--> true if pattern will behave in string functions, false if it requires Ustring functions
The function log
sends a message to the log if a mw.ustring
function can be replaced by the corresponding string
function.
local export = {}
-- Non-ASCII bytes. Only \128-\191 and \194-\244 are actually used in UTF-8.
local nonASCIIByte = "[\128-\255]"
local nonASCIIChar = "[\194-\244][\128-\191]+"
-- Character classes that will match non-ASCII codepoints if they are used in
-- Ustring pattern-matching functions.
local ustringClass = "%%[aAcCdDlLpPsSuUwWxX]"
local function isPattInPatt(str, patt1, patt2)
for match in string.gmatch(str, patt1) do
if string.find(match, patt2) then
return true
end
end
return false
end
-- Function to determine whether a pattern will behave exactly the same in the
-- basic string functions as it does in the Ustring functions.
-- The Lua-implemented version of Ustring has a function that supposedly makes
-- this determination, but it's overly conservative (disqualifying a pattern
-- if it contains non-ASCII bytes). Not sure about the PHP version of Ustring
-- that is actually used by Scribunto.
-- This does not check that the string is well-formed UTF-8.
function export.canUseString(pattern)
assert(type(pattern) == "string", ("argument #1 to canUseString should be string, but is " .. type(pattern)))
-- Remove percent sign followed by anything besides a letter.
pattern = string.gsub(pattern, "[%%]%A", "")
-- If non-ASCII inside a set: "[αειου]", "[^αειου]"
if isPattInPatt(pattern, "%b[]", nonASCIIByte) then
return false
end
-- In Ustring, the classes listed in the pattern all contain multi-byte characters.
if string.find(pattern, ustringClass) then
return false
end
-- Quantifier following multi-byte character:
-- in basic string function, quantifiers act on bytes, not on UTF-8 characters.
if string.find(pattern, nonASCIIChar .. "[?*%-+]") then
return false
end
-- In basic string function, dot matches a single byte, not a UTF-8 character.
if string.find(pattern, "%.") then
return false
end
return true
end
function export.log()
for _, funcname in ipairs{ "find", "match", "gmatch", "gsub" } do
local old_func = mw.ustring[funcname]
mw.ustring[funcname] = function (str, patt, ...)
if export.canUseString(patt) then
mw.log(funcname, str, patt, "can use string")
end
return old_func(str, patt, ...)
end
end
end
return export