Module talk:bg-common/testcases

From Wiktionary, the free dictionary
Latest comment: 10 months ago by Chernorizets in topic Monosyllabic stress for prefixes and suffixes
Jump to navigation Jump to search

Monosyllabic stress for prefixes and suffixes

[edit]

Hi @Benwing2,

I was hoping you could help me understand a couple of implementation details in Module:bg-common. I've been adding test cases for the module, and I've noticed that:

  • add_monosyllabic_stress doesn't add stress to monosyllabic suffixes, but it does to monosyllabic prefixes;
  • remove_monosyllabic_stress doesn't remove stress from monosyllabic suffixes, but it does remove it from monosyllabic prefixes.

Since you authored the module, I was hoping you could tell me why that is. Also, looking at the code, I'd say it assumes inputs to be in NFD form, so that the acute accent is decomposed as a separate character in the string - is that correct?

Thanks,

Chernorizets (talk) 05:03, 29 December 2023 (UTC)Reply

@Chernorizets I am not actually sure why those functions add and remove stress to/from prefixes. Not doing it for suffixes is correct, but they probably shouldn't do it for prefixes either. This module was based on Module:ru-common and inherited this behavior from that module. I think maybe the reason I didn't add a special case for prefixes is that prefixes don't normally pass through these functions; the functions are used during conjugation and declension, and it doesn't really make sense to conjugate or decline a prefix. But I think you'll find that similar Romance-language modules do have special cases for prefixes, and I think we should do so here too. As for NFD form, it turns out that there are no precomposed Cyrillic characters involving acute accents, so it doesn't really matter. The corresponding Russian module Module:ru-common has a comment at the top that explicitly indicates that transliterations must be in NFD form, but Cyrillic need not be, and the functions in that module take care to correctly handle ѐ and ѝ, which have precomposed versions. It looks like this doesn't happen in the Bulgarian module because grave accents aren't touched at all. Benwing2 (talk) 05:19, 29 December 2023 (UTC)Reply
@Benwing2 thanks for the context. Yeah I'm not sure why the ѝ test case for add_monosyllabic_stress works the way it does - I would've expected и́̀ or ѝ́, but neither of those is the outcome, even when I write it as и + combining grave accent rather than the precomposed character. Good to know about the acute accents - I was going by the fact that one of the first things we do in Module:bg-pronunciation is get the NFD form of the input. From your comments, it sounds like that step might be of little utility, since we end up having to re-compose breves (e.g. й), and the acute accent we use to indicate stress is not composed anyway. The only other "pronunciation" characters are the grave accent and the dot under, the latter of which I really wish we could ditch in favor of just using |endschwa=.
RE: expected callers - I'd feel a lot better if code that goes in a "common" module makes no assumptions about who's calling it; otherwise, I have to challenge whether it's truly common. I'm happy to make any reasonable changes to common utility methods to make sure they are caller-independent.
RE: "not doing it for suffixes is correct" - can you elaborate? I want to write documentation for the functions in this module (and, over time, the other BG modules), so I want to capture things like that.
Thanks,
Chernorizets (talk) 08:44, 29 December 2023 (UTC)Reply
@Chernorizets The test case "works" because MediaWiki converts all text to NFC form upon saving the page, so ѝ gets turned into a single precomposed character even if you type it as two characters, and that precomposed character isn't listed among the set of vowels, so it doesn't get an accent added. The converting to NFD followed by recomposing is modeled after the corresponding step in Russian, which is done to handle acutes and graves in both Cyrillic and (especially) the Latin transliteration. See Module:ru-common#L-236 and the comment below it. This is of less utility for Bulgarian because we don't currently support manual transliteration in Bulgarian (and maybe never well, since it's probably not necessary). Since the only grave-accented Cyrillic characters that are precomposed are ѐ and ѝ, it might make sense to rewrite the decompose/recompose functionality to just handle those two characters. As for treating suffixes specially, at least in Russian, some suffixes are habitually stressed, e.g. -но́й (-nój), and some are never stressed, e.g. -ный (-nyj), and some can be either, e.g. -чатый (-čatyj) or -ча́тый (-čátyj). This means that the stress mark is significant in suffixes, including monosyllabic ones, so we don't want to remove it or add it automatically. See also this comment from Module:ru-common:
-- Remove acute and grave accents in monosyllabic words; don't affect
-- diaeresis (composed or uncomposed) because it indicates a change in vowel
-- quality, which still applies to monosyllabic words. Don't change suffixes,
-- where a "monosyllabic" stress is still significant (e.g. -ча́т short
-- masculine of -ча́тый, vs. -́чат short masculine of -́чатый).
-- NOTE: Translit must already be decomposed! See comment at top.
I'm not sure whether the same thing applies to Bulgarian, but I expect it does.
As for dot-under vs. |endschwa=, generally I actually prefer the former because the latter doesn't work well in multiword expressions, where a word somewhere in the middle may require |endschwa= and it gets awkward trying to specify which word using a separate param. This issue comes up in Russian, for example, where we have a separate |gem= param to indicate whether geminate consonants are pronounced geminate, non-geminate or optionally geminate, but it's problematic with multiple words or even with single words containing multiple geminate consonants.
As for expectations of callers, in Module:ru-common I expect callers to pass in decomposed text because otherwise every function would have to do the decomposition itself at the beginning, which would be expensive and wasteful (and it would also get quite complex if we want to be truly caller-independent and handle both decomposed and pre-composed text, because we'd have to detect whether the text is decomposed or composed, and in the former case do nothing but in the latter case we'd have to decompose, do the operation and recompose at the end). Many of the operations in Module:ru-common are complicated because they have to be able to do things like applying reduction, dereduction, destressing (which also converts ё to е), etc. not only to a Cyrillic word but to a combination of a Cyrillic word and its manual transliteration; for various reasons this often requires splitting up a word into its syllables, doing operations syllable-by-syllable in parallel with the Cyrillic and its transliteration, and pasting the results back together. These sorts of operations need to be done on decomposed text because they need to check for accents separately from the vowel they're attached to. These operations are simpler in Bulgarian without manual transliteration to worry about, but still I'd be wary of trying to make everything work with both decomposed and pre-composed text; easier just to require that callers decompose their text at the very beginning, and recompose if necessary at the end (however, since MediaWiki automatically converts text to NFC upon saving, it's often not necessary to recompose; this is only needed if you allow the operation in question to be used as part of some other operation that might do some additional processing on the text and expect it to be composed). Benwing2 (talk) 09:19, 29 December 2023 (UTC)Reply
@Benwing2 thanks for the exhaustive answer! Manual transliteration is not needed for standard Bulgarian, although it may come up one day if we decide to give proper transliterations for e.g. pre-reform spellings of words. Prior to 1945, for example, all Bulgarian words with a word-final consonant would be written with a silent ъ (e.g. домъ instead of modern дом), and the current algorithm would transliterate the silent ъ as its full-pronounced counterpart.
RE: caller expectations - definitely agree on things like "assume the input is already decomposed"; I was referring more to things like "this function is (normally) only called by module X".
RE: suffixes - what I gather from your explanation is that the monosyllabic stress functions should be meaning-preserving and grammatical, i.e. their only effect is to (optionally) indicate stress that's already inherently there, vs introduce it where it doesn't belong, thereby either breaking a grammatical rule, or affecting semantics by adding it. Is that understanding correct? As for Bulgarian suffixes, I know that some could be either stressed or unstressed depending on word, and it's possible that some could be just one but not the other, although I don't have good examples OTOH. Either way, the same is likely true for prefixes, which bolsters the case that prefixes should probably also be excluded.
RE: MediaWiki doing a NFC step behind the covers - sneaky! :-) Never would've guessed that.
RE: grave accents - I'll likely take a closer look at how we handle orthographic grave accents (i.e. instances where it's required in Bulgarian text). In addition to ѝ, there is the version of по̀ (more) when used by itself, and not as the comparative prefix for adjectives, i.e. "той е по̀ човек от теб" (he's more of a human being than you are). I don't think the current translit module would handle this correctly, which I will verify when I get to its tests; fortunately it would be an easy fix. I'll likely also centralize our various regexes for "is a vowel", "is a consonant" etc for Bulgarian in the common module, along with a few other constants (notably pronunciation notation) we tend to copy-paste. Chernorizets (talk) 09:56, 29 December 2023 (UTC)Reply