Croatian ancient river naming conventions
January 20, 2024 at 10:35 am
(This post was last modified: January 20, 2024 at 10:47 am by arewethereyet.)
So, as I am sure many of you know, back in 2022, I published a paper called "Etimologija Karašica" in Valpovački Godišnjak and Regionalne Studije. My paper got mentioned by Glas Slavonije. In that paper, I attempt to use the basic information theory (collision entropy and birthday calculations) to recover the Croatian ancient river naming conventions.
To summarize, I think that I have thought of a way to measure the collision entropy of different parts of the grammar, and that it is possible to calculate the p-values of certain patterns in the names of places using them. The entropy of the syntax can obviously be measured by measuring the entropy of spell-checker word list such as that of Aspell and subtracting from that an entropy of a long text in the same language (I was measuring only for the consonants, I was ignoring the vowels, because vowels were not important for what I was trying to calculate). I got that, for example, the entropy of the syntax of the Croatian language is log2(14)-log2(13)=0.107 bits per symbol, that the entropy of the syntax of the English language is log2(13)-log2(11)=0.241 bits per symbol, and that the entropy of the syntax of the German language is log2(15)-log2(12)=0.3219 bits per symbol. It was rather surprising to me that the entropy of the syntax of the German language is larger than the entropy of the syntax of the English language, given that German syntax seems simpler (it uses morphology more than the English language does, somewhat simplifying the syntax), but you cannot argue with the hard data. It looks as though the collision entropy of the syntax and the complexity of the syntax of the same language are not strongly correlated. The entropy of the phonotactics of a language can, I guess, be measured by measuring the entropy of consonant pairs (with or without a vowel inside them) in a spell-checker wordlist, then measuring the entropy of single consonants in that same wordlist, and then subtracting the former from the latter multiplied by two. I measured that the entropy of phonotactics of the Croatian language is 2*log2(14)-5.992=1.623 bits per consonant pair. That 5.992 bits per consonant pair has been calculated using some mathematically dubious method involving the Shannon Entropy (as, back then, I didn't know that there is a simple way to calculate the collision entropy as the negative binary logarithm of the sum of the squares of relative frequencies of symbols, I was measuring the collision entropy using the Monte Carlo method). Now, I have taken the entropy of the phonotactics to be the lower bound of the entropy of the phonology, that is the only entropy that matters in ancient toponyms (entropy of the syntax and morphology do not matter then, because the toponym is created in a foreign language). Given that the Croatian language has 26 consonants, the upper bound of the entropy of morphology, which does not matter when dealing with ancient toponyms, can be estimated as log2(26*26)-1.623-2*0.107-5.992=1.572 bits per pair of consonants. So, to estimate the p-value of the pattern that many names of rivers in Croatia begin with the consonants 'k' and 'r' (Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica), I have done some birthday calculations, first setting the simulated entropy of phonology to be 1.623 bits per consonant pair, and the second by setting the simulated entropy of phonology to be 1.623+1.572=3.195 bits per consonant pair. In both of those birthday calculations, I assumed that there are 100 different river names in Croatia. The former birthday calculation gave me the probability of that k-r-pattern occuring by chance to be 1/300 and the latter gave me the probability 1/17. So the p-value of that k-r-pattern is somewhere between 1/300 and 1/17. Mainstream linguistics considers that k-r pattern in Croatian river names to be a coincidence, but nobody before me (as far as I know) has even attempted to calculate how much of a coincidence it would have to be (the p-value). So I concluded that the simplest explanation is that the river names Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica are related and all come from the Indo-European root *kjers meaning horse (in Germanic languages) or to run (in Celtic and Italic languages). I think the Illyrian word for "flow" came from that root, and that the Illyrian word for "flow" was *karr or *kurr, the vowel difference 'a' to 'u' perhaps being dialectical variation (compare the attested Illyrian toponyms Mursa and Marsonia, the names Mursa and Marsonia almost certainly come from the same root, but there is a vowel difference 'a' to 'u' in them). Furthermore, based on the historical phonology of the Croatian language and what's known about the Illyrian language (for example, that there was a suffix -issia, as in Certissia, but not the suffix -ussia), I reconstructed the Illyrian name for Karašica as either *Kurrurrissia or *Kurrirrissia, and the Illyrian name for Krapina as either *Karpona or *Kurrippuppona, with preference to *Karpona. Do those arguments sound compelling to you?
On the Internet forums, I thus far received two somewhat-serious objections:
1. A Reddit user called neuralbeans thinks that my experiment is flawed because it doesn't take into account the possibility that the nouns in the Croatian language have a significantly lower collision entropy than the rest of the words in the Aspell word-list. If so, the upper bound of the p-value could be higher than 1/17. I don't think that's a serious flaw, I think that's a blatant ad-hoc hypothesis. What magic would make the nouns in the Croatian language have a significantly lower collision entropy than the rest of the words in the Aspell word-list? I can see how that can be true in the Swahili language, where due to the noun classes nouns cannot start with some prefixes that verbs can, but I fail to see how that could be true in the Croatian language. Furthermore, why would the collision entropy of nouns be lower, rather than higher? I don't think the burden of proof is on me to do a more complicated experiment due to somebody's ad-hoc hypothesis.
2. A forum.hr user called DarkDivider claims that Proto-Slavic phonotactics didn't allow four syllables with yers to be consecutive. If true, that would make my etymology that Karašica comes from Illyrian *Kurrurrissia (via a Proto-Slavic form *Kъrъrьsьja) invalid, as *Kъrъrьsьja contains four consecutive syllables with yers. But I cannot find any reliable source claiming that or claiming the opposite. It sounds like a weird claim to me because phonotacticses of various languages usually do the opposite (requiring vowel harmony...).
If I am right about the river names, that suggests I am probably also right about Croatian toponyms suggesting that the Croatian War of Independence didn't happen, right?
To summarize, I think that I have thought of a way to measure the collision entropy of different parts of the grammar, and that it is possible to calculate the p-values of certain patterns in the names of places using them. The entropy of the syntax can obviously be measured by measuring the entropy of spell-checker word list such as that of Aspell and subtracting from that an entropy of a long text in the same language (I was measuring only for the consonants, I was ignoring the vowels, because vowels were not important for what I was trying to calculate). I got that, for example, the entropy of the syntax of the Croatian language is log2(14)-log2(13)=0.107 bits per symbol, that the entropy of the syntax of the English language is log2(13)-log2(11)=0.241 bits per symbol, and that the entropy of the syntax of the German language is log2(15)-log2(12)=0.3219 bits per symbol. It was rather surprising to me that the entropy of the syntax of the German language is larger than the entropy of the syntax of the English language, given that German syntax seems simpler (it uses morphology more than the English language does, somewhat simplifying the syntax), but you cannot argue with the hard data. It looks as though the collision entropy of the syntax and the complexity of the syntax of the same language are not strongly correlated. The entropy of the phonotactics of a language can, I guess, be measured by measuring the entropy of consonant pairs (with or without a vowel inside them) in a spell-checker wordlist, then measuring the entropy of single consonants in that same wordlist, and then subtracting the former from the latter multiplied by two. I measured that the entropy of phonotactics of the Croatian language is 2*log2(14)-5.992=1.623 bits per consonant pair. That 5.992 bits per consonant pair has been calculated using some mathematically dubious method involving the Shannon Entropy (as, back then, I didn't know that there is a simple way to calculate the collision entropy as the negative binary logarithm of the sum of the squares of relative frequencies of symbols, I was measuring the collision entropy using the Monte Carlo method). Now, I have taken the entropy of the phonotactics to be the lower bound of the entropy of the phonology, that is the only entropy that matters in ancient toponyms (entropy of the syntax and morphology do not matter then, because the toponym is created in a foreign language). Given that the Croatian language has 26 consonants, the upper bound of the entropy of morphology, which does not matter when dealing with ancient toponyms, can be estimated as log2(26*26)-1.623-2*0.107-5.992=1.572 bits per pair of consonants. So, to estimate the p-value of the pattern that many names of rivers in Croatia begin with the consonants 'k' and 'r' (Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica), I have done some birthday calculations, first setting the simulated entropy of phonology to be 1.623 bits per consonant pair, and the second by setting the simulated entropy of phonology to be 1.623+1.572=3.195 bits per consonant pair. In both of those birthday calculations, I assumed that there are 100 different river names in Croatia. The former birthday calculation gave me the probability of that k-r-pattern occuring by chance to be 1/300 and the latter gave me the probability 1/17. So the p-value of that k-r-pattern is somewhere between 1/300 and 1/17. Mainstream linguistics considers that k-r pattern in Croatian river names to be a coincidence, but nobody before me (as far as I know) has even attempted to calculate how much of a coincidence it would have to be (the p-value). So I concluded that the simplest explanation is that the river names Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica are related and all come from the Indo-European root *kjers meaning horse (in Germanic languages) or to run (in Celtic and Italic languages). I think the Illyrian word for "flow" came from that root, and that the Illyrian word for "flow" was *karr or *kurr, the vowel difference 'a' to 'u' perhaps being dialectical variation (compare the attested Illyrian toponyms Mursa and Marsonia, the names Mursa and Marsonia almost certainly come from the same root, but there is a vowel difference 'a' to 'u' in them). Furthermore, based on the historical phonology of the Croatian language and what's known about the Illyrian language (for example, that there was a suffix -issia, as in Certissia, but not the suffix -ussia), I reconstructed the Illyrian name for Karašica as either *Kurrurrissia or *Kurrirrissia, and the Illyrian name for Krapina as either *Karpona or *Kurrippuppona, with preference to *Karpona. Do those arguments sound compelling to you?
On the Internet forums, I thus far received two somewhat-serious objections:
1. A Reddit user called neuralbeans thinks that my experiment is flawed because it doesn't take into account the possibility that the nouns in the Croatian language have a significantly lower collision entropy than the rest of the words in the Aspell word-list. If so, the upper bound of the p-value could be higher than 1/17. I don't think that's a serious flaw, I think that's a blatant ad-hoc hypothesis. What magic would make the nouns in the Croatian language have a significantly lower collision entropy than the rest of the words in the Aspell word-list? I can see how that can be true in the Swahili language, where due to the noun classes nouns cannot start with some prefixes that verbs can, but I fail to see how that could be true in the Croatian language. Furthermore, why would the collision entropy of nouns be lower, rather than higher? I don't think the burden of proof is on me to do a more complicated experiment due to somebody's ad-hoc hypothesis.
2. A forum.hr user called DarkDivider claims that Proto-Slavic phonotactics didn't allow four syllables with yers to be consecutive. If true, that would make my etymology that Karašica comes from Illyrian *Kurrurrissia (via a Proto-Slavic form *Kъrъrьsьja) invalid, as *Kъrъrьsьja contains four consecutive syllables with yers. But I cannot find any reliable source claiming that or claiming the opposite. It sounds like a weird claim to me because phonotacticses of various languages usually do the opposite (requiring vowel harmony...).
Administrator Notice
Wall O'Text hidden.
Wall O'Text hidden.
If I am right about the river names, that suggests I am probably also right about Croatian toponyms suggesting that the Croatian War of Independence didn't happen, right?