Module talk:pi-decl/noun
Add topicNouns in -ar
[edit]Fixed Bug for Latin
[edit]@Octahedron80 The page bhātar is behaving strangely – the "b" is being dropped. —Aryamanarora (मुझसे बात करो) 20:26, 23 April 2016 (UTC)
- Fixed. By the way, I was not one who added 'ar' case. Please check the forms either. --Octahedron80 (talk) 00:04, 24 April 2016 (UTC)
- Thank you very much! I added the "ar" case, I was just wondering why it wasn't working. I used a Pali-language textbook, so I think that it's correct. —Aryamanarora (मुझसे बात करो) 03:49, 24 April 2016 (UTC)
Other Scripts
[edit]The "ar" case is implemented for Latin only! Whoever deals with irregular nouns in general might want to sweep the other scripts up for -ar. - RichardW57 (talk) 18:50, 15 September 2018 (UTC)
I implemented it for other scripts anyway. RichardW57 (talk) 20:17, 2 October 2018 (UTC)
Thoughts on Irregular Nouns
[edit]My view is that we would do better, from the coder's viewpoint at least, to define the endings in the Latin script, and then convert. There are some subtleties to consider for the Burmic scripts on the choice between round and tall AA - the choice between ဗာ and ဗါ has changed in recent centuries, and Christian literature (presumably not in Pali!) uses the old style. Lan Na is in flux. I would expect a word to be internally consistent. The commonest example in the Tipitaka is pabbājetabbā, and I would expect both instances of 'bbā' to be written the same.
There may be some script-dependent differences in declension; there are in conjugation.
I'm just sharing my thoughts for whoever picks up the inflections. I'll ping people or go to the grease pit when action is imminent. - RichardW57 (talk) 18:50, 15 September 2018 (UTC)
Correct Design
[edit]I believe the correct design is to set up a database of irregular forms so that they can be shared across scripts &endash; inflection in different scripts is shown on different pages. This, however, requires modules, which would be rejected on the basis that they are too scary to be readily edited. RichardW57 (talk) 12:36, 5 October 2018 (UTC)
Proposed Design for Irregular Nouns
[edit]@AryamanA, @Wyang, @Octahedron80, @Erutuon:
As at present, a noun or pronoun will be assigned to a declension using the current 'ending' parameter - either default or explicit. Any set of case/number forms may be overridden by parameter names noms, nomp,..or vocp. These parameter names are composed of the 3-letter case abbreviation and one-letter number abbreviation used by {{inflection of}}
. The value will be the complete case form, including the stem. (Several irregular forms include modification of the stem.) If there is more than one case form, additional case forms can be supplied using, for example, inss2 to give a noun two forms for a case, e.g.:
{{pi-decl-noun|bala|g=n|inss=balena|inss2=balasā}}
We already have this mechanism in place for template {{pi-alt}}
for alternative forms by script.
Parameters such as novoc
can be used to suppress a case, in this case the vocative. This would be appropriate for many pronominals, for which a vocative is inappropriate and thus unattested.RichardW57 (talk) 12:36, 5 October 2018 (UTC)
I believe this design is consistent with adding footnotes; some forms may be limited in time or or location. It is also compatible with bulk modifications to regular affixes, such as round v. tall AA, and shortening of the stem vowel.RichardW57 (talk) 12:36, 5 October 2018 (UTC)
The idea of only being able to completely replace the form for a case-number combination is motivated by simplicity. However, should we allow a family of parameters such as <abls_mod> to extend the list of forms? Thus we could have
{{pi-decl-noun|vagga|abls=vaggaso|abls_mod=after|g=m}}
to add vaggaso to the three ablative singular forms we would normally display. The values for abls_mod would be before
, after
and the default, replace
. ('Replace' would only take effect if a replacement was provided.) Should 'after' be the default? RichardW57 (talk) 12:36, 5 October 2018 (UTC)
Parameters like novoc
would naturally be treated as boolean parameters. Is novoc=y
sufficiently clear?
An explicit double negative such as novoc=n
would not normally be called for.RichardW57 (talk) 12:36, 5 October 2018 (UTC)
@AryamanA, @Wyang, @Octahedron80, @Erutuon: I've now got a candidate new version:
Module: Module:User:RichardW57/pi-decl/noun
Invoking template: Template:User:RichardW57/pi-decl-noun
Examples: Module:User:RichardW57/pi-decl/noun/testcases/documentation
I intend to deploy it this week.
I've gone for the default being to append the alternative case forms added. This makes for more work for the pronouns and pronominals, but is easier for stray ablatives in -so and -to and stray pronominal case forms, and for Magadhisms in the nominative. RichardW57 (talk) 00:49, 28 October 2018 (UTC)
The changes have now been deployed. RichardW57 (talk) 00:12, 2 November 2018 (UTC)
Bug in Syncopation in Feminine Nouns in -i/-ī
[edit]The oblique singular of ratti should have both 'rattiyā' and the syncopated 'ratjā' as an alternative. We only give the syncopated from for one case, and misgenerate it as 'rattyā', which my text books tell me is impossible. The patten of IDAGL singular najjā for nadī and jacca for jāti are nowhere to be seen. Perhaps we need to handle them in the invocation of pi-decl-noun - I think a flag would be simplest given that we have so many scripts. Some of the corrections may have to be deferred to the fixing of the irregular nouns. - RichardW57 (talk) 18:50, 15 September 2018 (UTC)
Correction: 'ratyā' not 'ratjā' and jaccā not jacca. RichardW57 (talk) 20:33, 2 October 2018 (UTC)
Thai-Lao Explicit /a/
[edit]The syncopated form of มะติ (mati) is coming out as มะตยา etc, which I think is wrong. I intend to correct the combination as a result of inflection so that we get มัตยา etc. I will probably also fix the 1s middle in aCve, though the benefit of the latter will be pretty low. I plan to release the changes along with input-enhancements.
- Belated signature --RichardW57m (talk) 09:55, 15 May 2023 (UTC)
Consonant Adjustments
[edit]@Octahedron80: Consonant adjustments have no place in *this* module. In particular:
term = gsub(term, "([ᨷ-ᨾ])᩠[ᨷᨸ]", "%1ᩛ")
is wrong. It appears to be a mistake for
term = gsub(term, "([ᨷ-ᨾ])᩠ᨻ", "%1ᩛ")
However, that would break the declension of ᩈᨾ᩠ᨻᩩᨴ᩠ᨵ! (That declension already has problems with round AA v. tall AA - I suspect round AA never actually occurs in it.)
The place of such adjustments is in transliteration. For declension, the stem is a given, not something to be calculated.
With rare exceptions - perhaps better captured by another mechanism - the only spelling adjustments should be those at the juncture of stem and affix. - RichardW57 (talk) 01:20, 30 September 2018 (UTC)
Nouns in -as and -an
[edit]Added them. As with nouns in -ar, the table is defined for Latin, and then converted to the other scripts on the fly. We may have to look for evidence of the inflections of nouns in -as - manas may be a bad model, as it is reported to be masculine in some occurrences, and, according to Duroiselle, it is only *used* in the singular, although the grammarians *mention* it in the plural. Duroiselle reports that there is, in some sense, an occasional plural manāni. RichardW57 (talk) 19:52, 3 October 2018 (UTC)
Supplementary Plane Problem - ustring bug?
[edit]@AryamanA, @Octahedron80 + @Wyang: The Unicode pattern handling in Lua seems to have bugs in handling characters in the supplementary planes. It may just be in handling patterns anchored at the end of the string. Does anything know anything about such a bug? I think I'm not the first to have met it - the code already showed special handling for Brahmi, which I now think is not for readability. RichardW57 (talk) 20:01, 3 October 2018 (UTC)
- I already know that Lua ustring/regex only useable for Basic plane. Planes higher than that must work in surrogate codes, that will be harder to write and fix bugs. --Octahedron80 (talk) 02:06, 4 October 2018 (UTC)
- @Octahedron80:: Is there a bug report on it? I couldn't find one in Phabricator. I don't think think it is working with surrogate codepoints; the misbehaving call also failed for Khmer and Sinhalese, apparently because of the presence of Brahmi in the same string. I think it is working in UTF-8, and someone forgot that a codepoint could be 4 bytes long. Fortunately, substrings are working; declension relies on that, and declension seems to be working for Brahmi. Perhaps we should recast the code to work by indexing a table or tables rather than by potentially slow regexes. RichardW57 (talk) 08:00, 4 October 2018 (UTC)
The actual problem is regex range beyond U+FFFF. For example, we cannot use[𑀓-𑀗]
, Lua will interpret it into surrogates[{D804}{DC13}-{D804}{DC17}]
that fail the logic. (But we could adapt it into{D804}[{DC13}-{DC17}]
.) Other functions still work as well as binary code matches. --Octahedron80 (talk) 08:59, 4 October 2018 (UTC)- I rewrote detectEnding not to use regex. --Octahedron80 (talk) 03:32, 5 October 2018 (UTC)
- I doubt the ustring library converts non-BMP characters into surrogates. It does translate Lua patterns into PHP regex (see UstringLibrary.php), so maybe there's a bug there. (I've found a bug myself.) But several of the script patterns in Module:scripts/data contain non-BMP characters, and script recognition, which uses the ustring library, works fine. — Eru·tuon 04:22, 5 October 2018 (UTC)
- Perhaps PHP is not updated to know all valid newest Unicode 11 scripts, making native regex drop/adapt some characters. I do not rely on it when dealing with non-BMP. --Octahedron80 (talk) 04:33, 5 October 2018 (UTC)
- If that happens, it's a bug. But I think the only place where ustring patterns use Unicode data is in character classes like
%a
,%s
. Otherwise everything is based on codepoints. We should be able to match unassigned codepoints and codepoints that have been newly assigned in Unicode 11. — Eru·tuon 04:57, 5 October 2018 (UTC)
- If that happens, it's a bug. But I think the only place where ustring patterns use Unicode data is in character classes like
- Perhaps PHP is not updated to know all valid newest Unicode 11 scripts, making native regex drop/adapt some characters. I do not rely on it when dealing with non-BMP. --Octahedron80 (talk) 04:33, 5 October 2018 (UTC)
- If I look at the pattern that doesn't work,
"[ᩈសස𑀲][ᨠ᩺ᨠ᩼គ៑ක්𑀓𑁆]$"
, it contains several full letter–combining character sequences, ᨠ᩺, ᨠ᩼, គ៑, ක්, 𑀓𑁆. I guess you want the set to match a full character followed by a combining character, but what actually happens is the set matches one of the codepoints in the set; that is, one of the full characters or one of the combining characters. To match one of several sequences, you need alternation (|
) as in regex, but Lua patterns don't have that. But you could match one of the full characters followed by one of the combining characters thus:"[ᩈសස𑀲][ᨠᨠគක𑀓][᩺᩼៑්𑁆]$
. (It would be nice if the ustring library had a grapheme-matching option – then you could put those grapheme clusters in a set and it would treat them correctly!) — Eru·tuon 04:43, 5 October 2018 (UTC)- The k's were dummies that would be wiped out by function "sc" he wrote. The main objective was to match two characters (letter & mark) at a time but he did not know how to wrote regex without pipe. (It was gonna be a lot of "or" conditions.) However, I already have the new approach and easier to edit. --Octahedron80 (talk) 04:58, 5 October 2018 (UTC)
- In this case, I did want the combinations of s + mark, so at present a grapheme-matching option would work in this case. However, Indic grapheme clusters are currently not safe for Indic scripts; there's a Gangetic push to make complete aksharas into grapheme clusters. I had a different approach with tables in mind - the matching sequences would have been the keys, and the declension ID would have been the value. I think my method would have been quicker, but @Octahedron80's is prettier. However, note that Brahmi SIGNs AA, E and VIRAMA can be practically indistinguishable without a base consonant; Firefox is rendering the plain mark without a dotted circle. RichardW57 (talk) 10:46, 5 October 2018 (UTC)
- I doubt the ustring library converts non-BMP characters into surrogates. It does translate Lua patterns into PHP regex (see UstringLibrary.php), so maybe there's a bug there. (I've found a bug myself.) But several of the script patterns in Module:scripts/data contain non-BMP characters, and script recognition, which uses the ustring library, works fine. — Eru·tuon 04:22, 5 October 2018 (UTC)
Substantives in -nt
[edit]I've come up with a scheme adding these to the regular declensions. It is not as obvious as I would like, so I am providing an opportunity to object to it before I start using it. I am still coding it in my sandbox.
There are three major pattens of declension:
1) Adjectives in -mant and -vant. The key distinguishing feature is that for masculine forms, the nominative singulars end in -mā and -vā. I will declare these to have 'endings' 'mant' and 'vant'. They can share declension tables.
2) Participles in -ant. The key distinguishing feature is that the nominative singular ends in -aṃ or -anto for masculines and -antaṃ for neuters. I will declare these to have the ending 'ant'.
3) Participles in -ent and -ont. Their distinguishing feature is that the nominative sinɡular can only be an a-stem form (-ento / -entaṃ / -onto / ontaṃ according to gender and verbal stem vowel). I will declare these to have the endings 'ent' and 'ont'. They can share declension tables.
Participles (and their derived nouns) will have to be manually tagged as having ending 'ant' as opposed to 'mant' or 'vant' if the substantive stem is ambiguous. Distinguishing Thai script 'ent' and 'ont' from 'ant' will be either complex or unreliable. The issue is that some or all of the onset of the final syllable of the stem comes between the preposed vowel and the 'n'; it won't always be just a single consonant that does, though in the vast majority of cases it will be a single consonant. RichardW57 (talk) 14:12, 5 October 2018 (UTC)
I have now implemented this. RichardW57 (talk) 13:39, 6 October 2018 (UTC)
Hi @RichardW57, I think your recent edits have introduced a Lua error on maṅgala. Can you please have a look? —Internoob 04:33, 2 November 2018 (UTC)
Sorry about that, @Internoob. I've now fixed the module code; the simpler fix would have been to delete the parameter, which is now completely redundant for most words, but there are a few other words the bug will have been affecting. RichardW57m (talk) 10:01, 2 November 2018 (UTC)
Transliteration Issues
[edit]I think there is an issue with the method used to provide transliterations. For Thai and Lao, the current standard method of invoking transliteration does not adequately identify the writing system; we solve the problem by using a specialised exported function trwo() instead of the standard tr(). Unfortunately, this causes the page to be listed in Category:Terms with redundant transliterations/pi (TBC) or Category:Terms with manual transliterations different from the automated ones/pi.
I think I had solved the problem with the following code:
if (item.tr and item.tr ~= '-') then
en_lang = en_lang or require("Module:languages").getByCode("en")
item.lang = en_lang -- Suppress transliteration by links.full_link
item.term = item.term .. '#pi'
else
item.lang = lang
end
instead of just
item.lang = lang
. I (as RichardW57m) simplified that code thus yesterday, 14 March, and now the 'redundant' category has 369 pages and the 'different' category has 96 pages. I will therefore experimentally revert the change and see what happens. --RichardW57 (talk) 23:21, 15 March 2023 (UTC)
- And now the numbers are down to 8 and 19. I did have a hacky solution to the problem, which I have now restored. --RichardW57 (talk) 01:04, 16 March 2023 (UTC)
There is a fourth parameter to links.full_link()
that seems to do what I want. I will implement it today. Before implementing the change, the numbers of 'redundant' and 'different' transliterations were 48 and 6. --RichardW57 (talk) 07:09, 29 May 2023 (UTC)
- The numbers are unchanged, so the change appears to have been successful. --RichardW57m (talk) 11:14, 31 May 2023 (UTC)
- Today I noticed that the numbers have shot up, to 377 and 83, with most of the entries being in the Thai script. I don't know when the change happened. To be investigated! --RichardW57 (talk) 16:48, 29 June 2023 (UTC)
- It turns out that
TheknightwhoBenwing2 had changed the interface, replacing the fourth parameter by a fieldno_check_redundant_translit
in the first argument. I've fixed the code, and the numbers are now going down as the pages are regenerated. --RichardW57 (talk) 17:58, 29 June 2023 (UTC)- And the numbers are now back down to 47 and 6. (I've removed some of the redundancies, and will be removing more over the next few days.) --RichardW57 (talk) 18:12, 29 June 2023 (UTC)
- It turns out that
- Today I noticed that the numbers have shot up, to 377 and 83, with most of the entries being in the Thai script. I don't know when the change happened. To be investigated! --RichardW57 (talk) 16:48, 29 June 2023 (UTC)
Error in Long Accusative Singular in Masculines in -in/-ī
[edit]We've been generating the consonantal accusative singular in -īnaṃ, whereas it should have been -inaṃ. I intend to correct it today or tomorrow, after I have updated the tests to check that we get the same tables for masculines in -in and masculines in -ī. We seem not to have any examples of this form in our Pali quotations at Wiktionary; we do have a few genitive/dative plurals in -īnaṃ. While masculines in -ī should be phased out, they're currently useful for Mon Pali, as can be seen in ကထိန် (kathin)). --RichardW57m (talk) 11:53, 31 May 2023 (UTC)
- I've now made the changes on Wiktionary. --RichardW57 (talk) 06:35, 1 June 2023 (UTC)
- The enhanced testing revealed that the work-around is also need for alphabetic Thai and all Lao script variants. I'll get round to fixing that soon. --RichardW57 (talk) 06:35, 1 June 2023 (UTC)
- The ending recognition wasn't working for abugidic Lao; that has now been fixed. There is an as yet unresolved bug in the testing for alphabetic Thai and Lao; the work around was not actually needed for Thai or alphabetic Lao, and now it is not needed at all for Thai and Lao scripts. --RichardW57m (talk) 09:44, 1 June 2023 (UTC)
- Bug in Module:pi-decl/noun/testcases has now been fixed - all the tests for Module:pi-decl/noun are now being passed (and honestly). --RichardW57m (talk) 10:20, 1 June 2023 (UTC)
@Octahedron80: You've cited a Wikimedia Pali grammar in Thai before, but I couldn't find a reference to it. Is it correct for this form, or does it mistakenly have the accusative singular with a long vowel? --RichardW57m (talk) 11:53, 31 May 2023 (UTC)
- @Octahedron80: The grammar is w:th:wikibooks:ภาษาบาลี/นามนาม, and it does have the error. --RichardW57m (talk) 10:55, 16 June 2023 (UTC)
Notifying @Apisite, Mahagaja, ВМНС, AryamanA. --RichardW57m (talk) 11:53, 31 May 2023 (UTC)
Lots of Pali textbooks out there. See this actual page 46-47 [1] However I don't know about -in. Please check modules if they contain mis-data. (I ain't familiar with case names.) I will correct accusative singular @ thwikt too. --Octahedron80 (talk) 11:35, 16 June 2023 (UTC)
- So some Thais get that accusative singular right. Masculine -in and masculine -ī are two names for the same declension. --RichardW57 (talk) 17:21, 16 June 2023 (UTC)