Recent blog entries

Searching through morphologically analyzed texts — 8 Nov 2011

I just finished the first draft of a tool that lets me search through a text which is morphologically analyzed on the fly (via XFST). It's rough for now, but quite awesome. Eventually I'll extend it to include morphologically disambiguated analyses such as those that vislcg would provide. For now, this will help me much to figure some things out. In addition, it provided another experience to learn; thus, quite worth it even if something like this exists already.

Here's a quick example, but you'll probably have to take my word for it (although the genitive is marked with -ood. I searched for the pattern Num N+Fem+Sg+Indef+Gen, or a numeral followed by a (feminine) noun in the genitive:

(waayo, Yoo'aab iyo reer binu Israa'iil oo dhammu halkaasay iska joogeen intii lix bilood ah, ilaa uu wada jaray wixii lab ahaa ee Edom joogay oo dhan;)

(For six months did Joab remain there with all Israel, until he had cut off every male in Edom:)

--

Markaasaa Axiiyaah qabsaday dharkii cusbaa oo uu qabay, oo wuxuu u kala jeexjeexay laba iyo toban meelood.

And Ahijah caught the new garment that {was} on him, and rent it {in} twelve pieces:

--

Kolkaasuu Yaaraabcaam ku yidhi, Toban meelood qaado, waayo, Rabbiga Ilaaha reer binu Israa'iil ah wuxuu leeyahay, Boqortooyada waan ka xoogayaa Sulaymaan, oo toban qabiil ayaan ku siinayaa,

And he said to Jeroboam, Take thee ten pieces: for thus saith the LORD, the God of Israel, Behold, I will rend the kingdom from the hand of Solomon, and will give ten tribes to thee:

--

Oo wakhtigii Sulaymaan reer binu Israa'iil oo dhan Yeruusaalem boqorka ugu ahaa wuxuu ahaa afartan sannadood.

And the time that Solomon reigned in Jerusalem over all Israel {was} forty years.

The results all come from the Old Testament, which has been pretty useful as far as being a huge text (and providing some out of context and humorous sentences). The translations come from another tool I wrote for searching through aligned texts.

In any case, there are some things left to do before this is "complete", but it's a good start!

0 comments

Inserting tones into a toneless text — 4 Nov 2011

As part of my masters thesis, I've been working on a Somali morphological analyzer and a syntactic disambiguator. A short introduction for anyone reading who doesn't know what these things are is: software that can tell you what the function of a word is in the sentence, and, when multiple posible functions exist, it chooses the one that is correct from context. In English for instance, the word 'can' can be both a auxilliary verb as well as a noun; but we English speakers know which is which when we hear the word in context.

In the case of Somali (and many languages), some forms are ambiguous in text that would not be in speech due to intonational and stress information. For Somali however, this means that information on number of nouns (éy 'dog' vs. eý 'dogs') and sometimes gender of nouns (masculine vs. feminine) is marked via tone. It is easy to imagine then, that when generating speech from text, producing better sounding (and grammatically sound) Somali speech would require being able to know where the tones are in a text. This is where these analytical tools come in handy... And conveniently, tonal patterns in Somali are mostly rule-based.

Naagta laybreeriga wax ku qoraysa ayaa soo socota.
'The woman who is writing in the library will come.'

After the morphological analyzer runs, we end up with input like the following:

naagta  naag+N+Fem+Sg+Def+Abs+Prox

laybreeriga laybreeri+N+Masc+Sg+Def+Abs+Prox

wax wax+N+Masc+Sg+Indef+Nom
wax wax+N+Masc+Sg+Indef+Gen
wax wax+N+Masc+Sg+Indef+Abs
wax wax+Pron+Indef+Abs

ku  +Nom+Prox
ku  ku+Adp
ku  ku+Pron+Pers+2Sg+Obj

dhex    dhex+N+Fem+Sg+Indef+Gen
dhex    dhex+N+Fem+Sg+Indef+Abs

qoraysa qor+V+Prog+3SgF+Ind+Pres+Red+Abs

ayaa    ayaa+CS+Foc/L+Subj+Null

soo soo+PP+Deic

socota  soco+V+3SgF+Ind+Pres+Red+Abs

There are a couple items that need to be removed here, and disambiguation is carried out by constraint grammar. Casting out the ambiguous possibilities in context rewards us with the following analysis:

"<naagta>"
    "naag" N Fem Sg Def Abs Prox 
"<laybreeriga>"
    "laybreeri" N Masc Sg Def Abs Prox 
"<wax>"
    "wax" Pron Indef Abs 
"<ku>"
    "ku" Adp 
"<dhex>"
    "dhex" N Fem Sg Indef Abs 
"<qoraysa>"
    "qor" V Prog 3SgF Ind Pres Red Abs 
"<ayaa>"
    "ayaa" CS Foc/L Subj Null
"<soo>"
    "soo" PP Deic 
"<socota>"
    "soco" V 3SgF Ind Pres Red Abs

... And these disambiguated forms can then be fed back into the morphological analyzer/generator to get the proper tone marking.

naágta laybreériga wax ku dhéx qóraysá ayaa soo socotá

I am a little unsure of the tone marking on dhéx (and in fact, ayaa should probably have a stress-tone on it too, as well as soo), but in any case, this was all carried out automatically, and these things may be fixed. Being able to provide input like this to a text-to-speech program would result in something a little less monotonous, and pleasing to the ear.

As the analysis progresses, it would even be possible to assign places where pauses are necessary, or where the ends of certain clauses are accompanied by boundary tones. ... There are also some other relevant phonological phenomena that could be processed in this manner and included in text-to-speech input.

Now that that's out of the way, does anyone know of some nice, open-source text-to-speech software that is open for use with any language and not just the largest ones?

2 comments

A data format for Somali language tools...? — 18 Jul 2011

As previously mentioned, I've been working on a morphological analyzer for Somali in my own time as means to learn about how such things are done. As part of my job, however, I've been working with several applications that are a result of this kind of work: one of which is a language learning website for Southern Sámi, a minority language spoken in Norway.

The website takes lexical data, stored in XML format, and combines it with a morphological analyzer/generator to produce learning exercises where students can practice how to inflect words in the various forms necessary to speak South Sámi properly. The application allows for exercises where there is a more rote form of learning, such as being presented with dictionary forms of verbs, and being told to inflect them into a certain case, as well as in context, where the user is presented with a sentence and told to fill in the blank. Since South Sámi has several cases, this is a real necessary exercise. One could compare it to English, where exercises might involve filling in the necessary preposition or pronoun form (I/me/my).

The use of a morphological analyzer in an external application provides a useful opportunity to improve and extend the tools and further improve them. Quite often, when working on this South Sámi web-app, we find places where the morphological analyzer might have a bug preventing words from being generated, and the bug may never have been found if it weren't for the need to generate tons of word forms for a learning application. In essence, every use of the tools leads to improvements for all applications.

Now for Somali

While working on the morphological analyzer for Somali, I had been recording word meanings in the hopes that something more would come of it later, but it was a fairly messy way of collecting lexical data, just storing them in the HFST LEXC source files. Taking a page from Giellatekno, I've decided to collect information in a more easily parseable file format, but the twist here is that I'm going to also use some of these files as means to compile parts of the morphological analyzer. Thus, I hope to store data in one place, and use it for multiple things.

Giellatekno uses XML as their file format of choice for storing lexical data. XML is great and well supported, but I thought I might go with YAML instead, as it is much more human readable, all you have to do is mind your indents and colons; much similar to Python (my language of choice). One of the other nice things about YAML is it allows you to refer to other parts of data via use of references-- which saves a lot of time if you have data that needs to be reproduced for each entry. Also, if it turns out in the long run that YAML is really a bad idea for what I want to do, it will be fairly easy to just convert it to XML; I'll just have to find a better way of storing notes on word entries and other repeated information. If it turns out not to work so well, maybe I'll at least be able to provide some good arguments for why XML or YAML are ideal for the applications I've got, and what other people working on other languages may want to consider.

I'm still working on my YAML format, but it's mainly based off of the data structure of Giellatekno's XML files, so much so that it would be quite easy for me to use existing Giellatekno applications with my data after a little scripting to convert.

Here's an example entry to consider:

- lemma: "iibso"
  deriv: "iibs{o}"
  <<: *HFST_V3B
  syntax:
    pos: V
    val: TV
  translations:
    - eng: "buy"
      fin: "ostaa"
      syntax: 
        deic: "soo"
    - eng: "sell"
      fin: "myydä"
      syntax:
        deic: "sii"

- lemma: "joogso"
  ... etc

The dash marks this off as part of a list. Note that quotation marks aren't necessarily required in these instances in YAML, but I've taken to using them just so that I can mark strings as separate from other types of data. The line containing <<: *HFST_V3B refers back to some preset values that are required to mark this word as belonging to a specific inflectional class, specifically for use with the morphological analyzer.

Translations are broken up into meaning groups (a list, with dashes again), and this example provides a sort of puzzle. We have essentially two separate words: sii iibso 'sell' and soo iibso 'buy', the differentiating factor is a directional particle that says whether the action involved a transfer from the speaker or to the speaker. The lemma is the same, so are these one entry? Or is it the meaning that differs, and that is what should decide?

For now, I'm taking the viewpoint of Somali, which seems to say that soo and sii are separate items, and they may vary to affect meaning in predictable ways, I am expecting that they may not always be predictable*, and that their presence is crucial to a specific meaning, thus they must be included. English and Finnish on the other hand, just use completely separate words for these concepts, so it's easier for us to decide that these items are really two separate words. Maybe I should do it a different way, but until I figure something out or decide, I'm open to suggestions-- in any case, it should be simple to find all of the words that I need to update when I need to rely on a different solution for the problem.

* The predictable meanings of the words are: soo 'motion towards', sii 'motion away', which makes complete sense with iibso

To dictionaries

The goal though is that all of this information is stored in one place. I want to produce separate applications, and potentially more down the line, but if I change one word translation or spelling, I want to make sure that I have the opportunity to reflect these changes in all places.

One of the main human-interest goals here is to produce a dictionary, available in several languages and Somali, which contains as much useful information about words which are relevant to learners, as well as the bare minimum necessary for native speakers. However, another way to look at it is that really, multilingual dictionaries are always for learners: whether you're using the somali->english side, or the english->somali side. I may well decide that I want to find ways to include example sentences from corpuses for English words as well... I tend to find these things useful in languages I learn too, and am constantly googling words to see how they work in context.

Further down the line

- lemma: "boosto"
  deriv: "boost{o}"
  <<: *HFST_D6_F
  translations:
    - eng: "post office"
  semantics:
    - "PLACE"
    - "BUILDING"

One of the goals with documenting words like this is to include semantic information about the words in question, because down the line this will help produce other tools, such as machine translators; or perhaps even more learning applications such as Oahpa. Oahpa uses categories like these to construct fill-in-the-blank sentences, and machine translation applications may need these semantic categories for real grammatical uses: for example, verbs expressing states and emotions may have syntactic patterns from other words (they certainly do in Finnish and Guaraní, and many other languages). I may attempt to absorb some semantic classes from existing word banks to get myself started, but if anyone has some suggestions for this, I'd be happy to hear.

Having a fairly well-working morphological analyzer also means that there will be several possible analyses for words. If I want to move down the line to syntactic analysis, which is the next stage in preparing for machine translation or even grammar correction (think about a word processor that corrects you if your verbs don't agree with the subject) will be syntactic disambiguation. As I'm familiar with Constraint Grammar, this is a likely path that I'll go.

Big undertaking

Things like this aren't easy, but the hope is that with a little additional planning from the beginning, I can at least make the difficult part of the work collecting the data, and not entering it or including it in all of the likely applications. It would also be easy to extend the existing data if I find new pieces of information I want to collect for each word, and do so hopefully without disrupting existing applications.

I'm always welcome to hear ideas, so drop a comment or an email. Or if you have texts or information you want to contribute, I'm quite happy with that too! Eventually I'll try to find a good way to get more direct contributions from others. I'd like to open the source, but I only want to open it once I've got it fairly cleaned up, and there is some small amount of data that I'm using for research on my masters project that I want to make sure is in the place it needs to be, undisturbed. It could be opening things up will be a little while down the road, but until then, I'm working on making some of the fruits of this labor available to the public in form of an online dictionary. It's already up, but I want to do a little more cleanup before I broadcast its location and claim it to be a fully functioning product. So, watch this space...

0 comments

Somali morphological analysis progress report — 9 Jul 2011

Over the past several months, I've been working on a Somali morphological analyzer. It's rule based, and built with HFST, so it takes a little bit of work to extend, but it runs quite quickly and smoothly. Following are some examples for varying forms of the word baabuur 'truck'.

baabuur
baabuur baabuur+N+Masc+Indef+Sg+Nom
baabuur baabuur+N+Masc+Indef+Sg+Abs
baabuur baabuur+N+Masc+Indef+Sg+Gen

baabuurro
baabuurro       baabuur+N+Fem+Pl+Indef*

baabuurradii
baabuurradii    baabuur+N+Fem+Def+Pl+Nom+Dist
baabuurradii    baabuur+N+Fem+Def+Pl+Abs+Dist

*Note: Somali has gender polarity for some words, which alternate between Fem. and Masc. in Sg. and Pl.

It can of course be turned around to generate word forms too, if you just input the analysis. This is one of the first stages of rule-based machine translation or text tagging, or what have you, for Somali. Once I'm far enough along with the analyzer, and have gone through and worked out the kinks, I'll probably work on disambiguating multiple analyses.

Along the way, I've been compiling a corpus of news articles with which to test coverage of my analyzer and help extend it... One of the ways I'm working to extend the analyzer with words now is by providing a means to automatically guess which inflectional type a word is. I'm happy to say this is on the way too, but not quite there yet. Either way, the plan is to dump a list of words into the (python) program, and extend the analyzer with those that pass with flying colors. Of course, these aren't many word categories in the program yet, but I'm fairly confident that I can get decent results.

Following is an example. Each word goes through a list of simple tests, and tests are assessed by count of forms fitting into some phonetic category contained in the word categories.

aalad, aalado, aaladda, aaladdu, aaladdii, aaladaha, aaladuhu, aaladihii
  D1F: 8/8  <--
  D1M: 5/8 
  D2M: 4/8 
  D2F: 4/8 
--
geed, geedka, geedku, geedkii, geedo, geedaha, geedihii, geeduhu
  D1F: 5/8 
  D1M: 8/8  <--
  D2M: 7/8 
  D2F: 1/8 
--
baabuur, baabuurka, baabuurku, baabuurro, baabuurrada, baabuurradii
  D1F: 2/6 
  D1M: 4/6 
  D2M: 6/6  <--
  D2F: 4/6 
--
magac, magaca, magucu, magicii, magacyo, magacyada, magacyadii
  D1F: 2/7 
  D1M: 5/7 
  D2M: 7/7  <--
  D2F: 4/7 
--
subax, subaxda, subaxdu, subaxdii, subaxyo, subaxyada, subaxyadii
  D1F: 5/7 
  D1M: 2/7 
  D2M: 4/7 
  D2F: 7/7  <--
--

And of course, I'm planning on making the source available for these programs as I clean up the source, remove my notes, and provide more useful documentation...

Related ...

2 comments

Qglic recommendations — 28 Jun 2011

Every once and a while I get a Qglic bug in my system, and I start tweeting or Facebook-statusing in it. Maybe you know the feeling, maybe you don't.

Qglic (as previously blogged about) is an alternative orthography for English, which has a goal of providing a more phonemic writing system (for non-linguists, a more one-to-one correspondence of sounds to letters); but the trick to it is to do this without using any "special" characters, and sticking to A-Z.

For it's part, Qglic is quite good at this, despite the many vowel and consonant sounds in English. General American English distinguishes between 24 separate consonants (maybe 25, if you include the 'wh' in which) and 14 vowels. Totaling those numbers, you can see that English has more contrastive sounds than it has letters to write them. Some languages on the other hand, are easily able to get away with having one letter for each contrastive sound, without really running out of letters in the alphabet.

One of the problems, raised to me on Twitter, was that Qglic works great, but only if you have the caught-cot merger. As it turns out, this is probably something that the creator of Qglic wasn't quite concerned about, perhaps because he doesn't have this distinction-- I think he might even be from western Canada, maybe that explains it? In any case, some words listed by Wikipedia as having separate vowels are the following:

bobble bauble bqbl
bock balk bqk
body bawdy bqdi
bot bought bqt
collar caller kqlr
chock chalk tcqk
hottie haughty hqti
odd awed qd
stock stalk stqk
tock talk tqk
wok walk wqk

For a good amount of Americans, these sound the same, and in Qglic they would be written the same as well. Qglic instead uses <o> for the sound in 'smoke', /oʊ/. Following are the Qglic monophthongs, and some sample words.

Eng. Qgl.
tack tak
tech tek
tick tik
took tjk
toke tok
talk tqk
tuck tuk
teak tyk
Turk trk
tuque twk

For the more visual linguist, here's a crappy ASCII IPA vowel chart containing Qglic monophthongs and an attempt at IPA approximation of the sounds that they represent:

  y  i           j   w              i   ɪ                      ʊ   u
    e      u       o                   e/ɛ       ʌ/ə         oʊ/ɔ
           r                                     ɜ˞/ɚ
      a          q                        æ                ɑ/ɔ

Qglic also provides some diphthongs:

take     teyk
tyke     tuyk
boy     boy
sound     saond

Demerging

But, how might this work if we had to throw in one more distinction, that is, talk/tock? As it turns out, the solution is already hiding in Qglic itself with <e> and <ey>. Merely making sure to represent the vowel sound in 'toke' as a diphthong always would free up one symbol for use with the word tock, and in addition it makes the vowel system a bit more consistent.

take     teyk
tech     tek
toke     touk
tock     tok

And again, a table:

y                   w               i                   u
   i      u       j                    ɪ     ə/ʌ      ʊ
ey        r        ou               eɪ      ɜ˞/ɚ       oʊ
    e            o                      ɛ            ɔ
     a         q                         æ         ɑ

This of course, covers 12 of the vowels of the 14-vowel system of General American, however it merges the proposed distinction between /ǝ/ and /ʌ/ , and the rhotic equivalents of these.

Consonantal fun

One of the additional things that has kind of bothered me about Qglic is the use of the apostrophe for the sound /ŋ/ in 'sing'. One option I've come across, aside from writing it with is to use a number-- it might give Qglic a cool flavor, but of course, I see how it sort of goes beyond the goal of using A-Z: sin, si9.

Other challenges?

Maybe you know of some other potential distinctions in English that this should cover, feel free to leave a comment. In any case, I've tested this kind of vowel system using RP and my own U.S.-ish English and found it to be quite suitable. For these texts I just found an RP transcription and retranscribed, and then found the text which was apparently more acceptable for Americans (you'll notice some lexical differences) and transcribed my speech.

RP

Xu nox wind un xu sun wu dispywti9 witc wuz xu strqngu, wen u travlu keim ulq9 rapt in u wom klujk. Xei ugryd xut xu wun hw fws suksydid in meiki9 xu travlu teik iz klujk qf cjb by kunsidud strqngu xun xy uxu. Xen xu nox wind blw uz hqd uz y kjd but xu mor y blw xu mo klujsly did xu travlu fujld hiz klujk urajnd him un ut lqs xu nox wind geiv up xy utemp xen xu sun cqn ajt womly un imydcutly xu travlu tjk qf iz klujk, un suj xu nox wind wuz ublqidc tu kunfes xut xu sun wuz xu strqngr uv xu tw.

'mercan

Xu norx wind an xu sun wr argywi9 wun dey ubeut witc uv xem wuz strqngr, wen u travlr keym ulong rapt up in an euvrkeut. Xey ugryd xat xu wun hw kjd meyk xu travlr teyk hiz keut qf wjd by kunsidrd strqngr xan xu uxr wun. Xen xu norx wind bliw az hqrd az hy kjd, but xu hqrdr hy bliw, xu tuydr xu travlr rapt hiz keut uraond him; an at last xu norx wind geyv up truying. Xen xu sun bigan tu cayn hqt, an ruyt uwey xu travlr tjk hiz keut qf. And seu xu norx wind had tw admit xat xu sun wuz strqngr xan hy wuz.

Previously...

0 comments

Ðer, they're/their/there — 26 Mar 2011

I realize people like to spell things with a mind to correctness, because having standards (or things near enough to standards) helps communication. However, one of the things I hear often is that things like they're/their/there need to be spelled correctly because otherwise it changes the meaning of sentences. There may be a couple rare cases where you can insert one of these items into the place of another and change the meaning, but they seem to be fairly seldom.

One of the points I like to press often, when people say that spelling things wrong makes the writer seem stupid, is that readers are more flexible than we seem to expect. One of the hallmarks of a functional literate person is being able to interpret written language through spelling errors, typos, or bad typesetting, or even when sentences have been cut and respelled severely to fit within the constraints of an SMS. Might an inability to interpret variation be just as big a problem as an inability (or lack of desire) to spell "properly"?

To illustrate the point with these words, following are 15 sentences. I've replaced all instances of they're, their and there with ðer, since in my dialect these are all the same. If I (and others like me) manage to understand eachother in speech, we obviously should be able to understand eachother in writing with these forms spelled the same.

The sentences are randomly selected from COCA. As there are fifteen, you should find at least 5 of each of the three forms, though some sentences have several. The idea was to collect at least 5 search results containing each of the separate forms. Some are fragments, and some are full sentences, and some have been trimmed to remove things that were generally tough to read. The hope with sentence fragments is that it will illustrate even more clearly that we can still succeed to understand even in compromised situations.

Enjoy!

... many people will still order a cheeseburger, soda and fries, ðer going to focus on that as well, Heidi, really important ...

... the professional services and social interaction that come with cubicle life, ðer crying out for support, not to mention a little chitchat.

... ... which could make them feel ashamed of ðer bodies. But I'm not sure what would be helpful to say or do ...

Real kids don't dress like bankers and fly around in ðer daddy's private jet.

ðer not the only people borrowing. ðer's a stimulus plan in Europe. ðer's a stimulus plan in China.

O'REILLY: Right. ðer going to take from the Social Security fund- ...

(Voiceover) Pruitt Rainey never saw the home gym his dad built for him just below ðer poker room. But his dad kept the promise he made to his son in ...

... a fashionable spa, and maybe even the occasional trip to Niagara, but ðer would certainly have been no Atlantic crossing and most certainly none of the sophisticated social ...

... and he would ask them about school and do his best to stay interested in ðer long-winded, half-baked descriptions. They were lucky when it came to ðer kids,

... although the last two days have been a couple good days. ðer's a sense out ðer that...

... Brooklyn, Detroit and London don't have to share the microphone here with ðer male counterparts. Nor does Peled upstage her subjects. She only adds a few ...

... my God! ðer going to show my hips, ðer going to...' And we brought three examples along to show. The one ...

... moderate jolt. Then ðer's a bit of a sway once you're hanging ðer. When you're actually on the water, ðer's a firmness that did ...

... ain't true. Remember I told you that. ðer only stories. ðer was more: fruits that we don't have no more because they came from

... who is coming for her very first reunion, just as Dorothy is. Though ðer the similarity ends, thank you very much.

0 comments

Finnish Grammar Exercizes — 22 Feb 2011

As a test to see how easy it would be to implement Oahpa for a new language, I decided to do so with already existing morphological and syntactic analysis tools and data for Finnish. I hope to spend a little time improving it in the next few weeks, 'cause it'll certainly benefit some Finnish learners out there. :)

Try it here: http://finoahpa.donchaknow.com/oahpa/

Oahpa is a collection of language learning games, which range from inflecting words to vocabulary building, to using word inflections in the context of sentences. For languages this is particularly important, because words are inflected in specific ways for certain types of sentences which take learners a while to grasp. For instance, the case of the object of verbs may vary depending on what the verb is. pitää plus partitive means 'hold', while pitää plus elative means 'like'. Similarly, if you say "I feel happy", the form that 'happy' takes differs from if you use it in a sentence like "I am happy."

The original Oahpa games are available in Northern Sámi and (coming soon) Southern Sámi, and they, like the Finnish test version, are based on morphological analysis tools which can generate all word forms given a specific word; as well as syntactic analysis tools which (generally) will mark words as subjects or objects, but also provide much more detailed information such as what verbs agree with. These analysis tools are then used to analyze sentences that learners type to give feedback on common issues, such as verb agreement, case usage and so on.

Although the Finnish Oahpa has only two exercises, I'll post some updates if I carry over more games from Northern Sámi Oahpa. In the meantime, happy inflecting!

0 comments

Translation to Qglic with Finite-State Technology — 4 Aug 2010

Qglic (pronounced Anglish) is a near-phonemic alternative writing system for English. Being near-phonemic, the goal is to have as close to a one-to-one correspondence between sounds in English and the letters used to represent these. One of the benefits to Qglic is that it attempts to do this using only the letters A through Z. You can see a small sample of it following, which is this paragraph but just written in Qglic.

Qglic iz ey funymik qltrnutiv ruyti'g sistum for I'glic. Byi'g funymik (or nirly so), xu gol iz tu hav ez klos tw ey wun-tu-wun koruspqnduns bitwyn saondz in I'glic and xu letrz ywzd tu reprizent xu saondz. Wun uv xu benufits ti Qglic iz xat it utemps ti dw xis ywzi'g only xu letrz A xrw Z. Yw kan sy u smol sampul uv it fqloi'g, witc iz xis perugraf but dcist ritun in Qglic.

I discovered Qglic a year or so ago, but recently remembered it and became all excited about it again. Using my newly acquired skills in various language technological applications, I spent some time putting together a simple finite-state machine based on the phonemic rules of Qglic, and the CMU Pronouncing Dictionary, which is vast and contains a huge amount of words (approximately 133,000). The CMU Pronouncing Dictionary contains pronunciation guides written with Arpabet, which means it's fairly easy to translate it into IPA or in this case, Qglic.

ABSCOND  AE0 B S K AA1 N D
ABSCONDED  AE0 B S K AA1 N D AH0 D
ABSCONDING  AE0 B S K AA1 N D IH0 NG 
ABSCONDS  AE0 B S K AA1 N D Z
ABSECON  AE1 B S AH0 K AO0 N
ABSENCE  AE1 B S AH0 N S
ABSENCES  AE1 B S AH0 N S IH0 Z
ABSENT  AE1 B S AH0 N T
ABSENTEE  AE2 B S AH0 N T IY1
ABSENTEEISM  AE2 B S AH0 N T IY1 IH0 Z AH0 M
ABSENTEES  AE2 B S AH0 N T IY1

Taking this data, I wrote a short Python script (I'll upload it somewhere at some point soon) to translate the pronunciation guides into Qglic, and then convert them to a format used to produce a file format compatible with the Helsinki Finite State Transducer Technology (HFST):

    abscond:abskqnd         ennd ;
    absconded:abskqndud             ennd ;
    absconding:abskqndi'g               ennd ;
    absconds:abskqndz               ennd ;
    absecon:absukon         ennd ;
    absence:absuns          ennd ;
    absences:absunsiz               ennd ;

It's a very simple finite-state machine, as far as the amount of effort put into producing it. It consists of just a huge list of words in the format of english:qglic, which represents a beginning path and the end path in the machine. The result is very fast: a 385 word article on Naomi Campbell testifying before a war-crimes tribunal from CNN is converted to Qglic in just 0.143 seconds, and the whole of The Importance of Being Earnest translates in about 1.3 seconds.

There are still some issues to work out, such as how I tokenize text, so, punctuation isn't perfect, and thus results in more words not being translated... However, since I'm using the CMU database, there are very few words that don't make it through, and if they don't, it's most likely a result of a tokenization error.

One of the other problems is that words which are homonymous are not handled ideally now (the first homonym is used always), which results in funny spellings when a word is both a noun and a verb ('The farmers prodúce próduce') but used as the other ('*The farmers próduce prodúce.'). Problems like these could be solved with a few more hours of work implementing already existing technologies to disambiguate between the two words based on sentence-sized contexts. If I get a little more time to work on this, maybe I'll iron those problems out and put some of the larger texts up online that are "translated".

Instead, enjoy a couple paragraphs of Naomi Campbell's court case, which has been cleaned up for punctuation issues that I need to fix. Looking through it otherwise, I see there is at least one other issue. See if you can spot it, or find more! ;)

Neyomy Kambul wil testufuy in wor kruymz truyl xrzdey

(cnn) -- Ey dcudc in xu wor kruymz truyl uv formr Luybiryun prezidunt Tcqrlz Teylr haz disuydid xat swprmqdul Neyomy Kambulz testumony in xu keys wil go uhed xrzdey.

Xu specul kort uv Syeru Lyon kunfrmd ti syenen wenzdey xat kambul wil teyk xu stand at xu trubywnul, dispuyt an imrdcunsy mocun xu difens fuyld mundey ti diley hr testumony.

Prqsikywtrz sey Teylr geyv Kambul ey duymund dri'g xu wor in Syeru Lyon, kqntrudikti'g Teylrz testumony xat hy nevr handuld xu precus stonz xat fywuld xu kunflikt.

0 comments

Some refudiations and thoughts about English and language prejudice — 19 Jul 2010

A lot of noise has been made recently due to a seemingly innocuous tweet from Sarah Palin, in which she defends her neologisms by comparing herself to Shakespeare, one of the English language's most known creators of words. Responses have varied but may be summed up as the following:

  • LOL#1: Sarah is comparing herself to the bard!
  • LOL#2: 'Refute' is a misused word, thus the formation of 'refudiate' is wrong.
  • LOL#3: (care of Roger Ebert, and others): Shakespeare's coinings are okay, whereas Palin's coinings are not okay.

Roger Ebert later tweeted and in a tongue-in-cheek manner used Sarah's own neologism to tease her. The humor for me in this is actually that, in criticizing Palin, Ebert has unwittingly admitted that her neologism is acceptable— after all, he managed to use it in a sentence, where the word could be understood.

Either way, these criticisms all seem to stem from peoples' low reguard for Palin's intelligence; or stem from the idea that language change is bad or wrong. English is a living language, and although it makes me feel funny that I am depending Palin on one issue, this is something she is completely right in, and I would gladly stake my linguistics degree (and degree in progress) on. Language changes in numerous ways. Words can:

  • ... Gain new meanings: gay used to mean just 'happy', and now it means, well, 'gay'; tweet used to be a noise that only birds could make, and now it is something humans do as well.
  • ... Be used in new contexts (contact used to only be a noun, but now may be used as a verb: I'll contact you tomorrow.)
  • ... Be derived from other words through a variety of phenomena. One of which (portmanteauization) is thoroughly accepted by those who consider themselves educated in English, turning: bro(ther) and (ro)mance into bromance.
  • ... Change in pronunciation: the modern English ask used to be acsan in Old English (yes, that's right, we used to say aks).
  • ... Change in social usage: ain't has its roots in 1600s England, but now is found in many regions in the U.S., both rurally and in urban settings (but the usage of ain't is often pointed out as having to do with lack of education, not dialect).

As a brief aside before getting on to the point: I find it useful citing historical roots of things like ask and ain't, because there are people out there who will strongly refudiate the usage of any word if it doesn't seem to be old enough. As if history means everything... But, back to the point.

One of the major problems with the backlash is that if people will happily adopt new words (such as tweet, to use an apt example), why won't they accept refudiate? If the claim is that refudiate is derived from the wrong usage of refute; then why is it okay to take the word tweet meaning roughly 'a noise birds make' and use it to describe the sending of a short 140-character message on twitter.com; and further derive new words from it: tweeps 'a person who uses the aforementioned site'. This said, I know there are people who reject even these neologisms, but the main thing I find fault in is that those who accept that Palin made this message on Twitter do not accept the words she is using despite that they are essentially the same.

What this all boils down to is a language prejudice, or for short: a prejudice. In this case, it is Palin's perceived intelligence which guides decisions on whether the new words are right or wrong. If people accept neologisms in one place, and they do not accept them in other places, there must be a reason for this— and right now, this is the only I can come up with that seems valid to me as a linguist. Since I know that words change in meaning over time, I see no problem with 'refute' taking on a new meaning. Since I know that new words are derived from existing words with the use of suffixes and prefixes (or even parts of other words, as with facestalker), I do not see 'misrefute' to be a problem either— after all, it has a clear meaning which is separate from either usage of 'refute' you adopt.

The problem with prejudices like these is that they are often used to fuel even worse situations, such as institutionalized racism. As I have observed in the U.S. (and many other places) some people often treat people with different dialects (regional, social) or speech defects differently merely on the basis of language. This is because we not only use language to communicate ideas, but to communicate who we are— sometimes we have a choice in the latter, and sometimes we simply can't help it. What we can help though is how we react to people based on their use of language. On paper, every language and dialect should be worth the same as every other, yet in reality, the difference between the worth of dialects and languages is all in our heads. We see people use language and dialect not only to communicate and self-identify, but to discriminate.

The situation with Sarah Palin illustrates the latter more than can any indeterminate truths about the usage of words and whether they are 'correct' or not. Perhaps one of the great ironies in this whole situation is that the people who are most loud on criticising Sarah for this recent language issue are both liberal (and supposedly forward-thinking on issues of equality) and language elitists. Ah, did I just implicate a 'liberal elite'? I guess Sarah can be credited to yet one more neologism...

3 comments

An imperative puzzle — 4 Jul 2010

Some of this is sort of adopted from a comment left elsewhere on the internets for someone asking about imperatives in languages. While musing over the data in Finnish and Northern Sámi, there appears to be an interesting puzzle: 2nd person imperatives are different from the imperatives formed for all other persons, in that non-2nd person imperatives appear to all be decended from an optative mood while 2nd-person imperatives are morphologically distinct. Perhaps this is analagous to the English imperative strategy, in which the 2nd person imperative is a bare verb stem: Go!, Sleep!; while other persons are formed periphrastically: May he go, Let him sleep.

Finnish and Northern Sámi

In Finnish, and closely related languages the second person imperative is formed with a bare verb stem, while other persons and numbers have additional morphemes, most of which include -k- (said by some to be a historical present tense marker).

(1)     mennä 'go'; mene-n 'I come

                sg.         pl.
        1.      --          menkäämme
        2.      mene        menkää
        3.      menköön     menkööt

The negative imperative is formed with help of an auxilliary negative verb, älä (2), which has similar morphology.

(2)             sg.         pl.
        1.      --          älkäämme
        2.      älä         älkää
        3.      älköön      älkööt

According to Maija Länsimäki, these ko/kö morphemes are originally from the optative. While this doesn't directly say anything about the plural 1st and 2nd persons, it seems like there's a chance that they are either related by way of optative, or connected to the present marker theory (2nd person imperative of tulla was originally *tulek).

What is just as interesting about this pattern is when the negative verb occurs with other verbs, e.g., don't go:

(3)             sg.             pl.
        1.      --              älkäämme menkö
        2.      älä mene        älkää menkö
        3.      älköön menkö    älkööt menkö

The same -ko/-kö appears on the verb. Is this a form of optative agreement, or something else? If these forms are connected, is the -ko/-kö marker found in questions (Nauroiko Mikko? 'Did Mikko laugh?') also related, or is this just a coincidence brought on by the small phoneme inventory in Finnish?

A similar pattern is to be found in Northern Sámi, as well, but slightly extended because NS allows for singular, dual and plural number (4). This paradigm is exactly the same for the negative auxilliary (5), however NS does not have anything similar to the -ko/-kö which occurs on the main verb in negative imperatives (these all occur in one form for all persons and numbers).

(4)     mannat 'to go'
                sg.         du.          pl.
        1.      mann-on     mann-u       mann-ot
        2.      mana        mann-i       mann-et
        3.      mann-os     mann-os-ka   mann-os-et

 
(5)     ale 'Neg'
                sg.         du.         pl.
        1.      allon       allu        allot
        2.      ale         alli        allet
        3.      allos       alloska     alloset

Here we see that 2nd person singular offers a bare stem, and that all other non-2nd person imperatives have a round vowel (o/u often alternate in NS, and in precisely this situation) which is specific to these situations only. The availability of dual in the paradigm allows us to see that there is something about 2nd person here that separates it from the other persons: and perhaps this is a difference of mood.

Estonian, as best as I can find, also has a similar pattern in the negative imperative auxilliary; but I can't find out how the main verbs go for non-2nd person imperatives. Anyone...?

(6)     minema 'go'
                sg.        pl.
        1       --         --
        2       mine       minge
        3       --         --

 
(7)     ära 'Neg'
                sg.        pl.
        1       --         ärgem
        2       ära        ärge
        3       ärgu       ärgu

Generalities?

This at least establishes that this pattern is similar in Finnish, Northern Sámi and Estonian (and apparently English), but what does it mean? One could assume from all of this that 'true' imperatives are restricted only to 2nd person, and other persons may be expressed with other moods for semantic reasons... 2nd person imperatives are only applied directly to the listener from the speaker and are commands, while 1st and 3rd person imperatives may refer to someone perhaps outside of the conversation and as such speakers may only wish for things that these persons may do.

Since I haven't Googled around yet, these are only my musings. May someone reading this come forward with more knowledge!

1 comments

Pagination