School's out! Woohoo! Now it's time to get working.
Since finishing exams, I've been spending the last week or so working with Constraint Grammar as part of my Google Summer of Code project in machine translation from Finnish to Northern Sámi. It's enlightening and interesting and there's much to learn, but it seems to give me precisely the kind of puzzles that I like to solve. Constraint Grammar is a syntactic formalism developed by Fred Karlsson (the author of the first Finnish grammar book I studied, which quite possibly changed my life) which has the essential goal of disambiguating words which are homophonous: have similar appearances but separate morphological uses or separate meanings.
An example:
minä lu-i-n kaksi kirja-a
1pSg.Nom READ-Prt-Sg1 TWO BOOK-Part
'I read two books.'
This all makes perfect sense to us, because we know what words are meant; however luin could mean "I read", or "with/by bones". Since the latter meaning is obviously not the one that we want for the sentence, Constraint Grammar provides a rule-based formalism for selecting the intended meaning based on the surrounding context. This isn't easy of course, because one actually needs quite a few rules to produce a fully disambiguated sentence, and natural sentences aren't always as simple as the one given above. Following is the full analysis of each word:
"<minä>"
"minä" Pron Pers Sg Nom
"mikä" Pron Interr Sg Ess
"<luin>"
"lukea" V Act Ind Prt Sg1
"luu" N Pl Ins
"<kaksi>"
"kaksi" Num Card Sg Nom
"<kirjaa>"
"kirja" N Sg Par
"kirjata" V Act Ind Prs Sg3
"kirjata" V Ind Prs ConNeg
"kirjata" V Act Imprt Sg2
As we can see, there are quite a few items that need to be removed (and listed in CG formalism below): the word minä can have its personal pronoun reading chosen because it precedes a verb with 1st person singular marking (line 1237); luin gets its verbal reading selected (as opposed to the 'bone' reading) because it follows a pronoun (line 1645); and finally kirjaa 'book+Part' is selected because it precedes a number.
1187: SELECT (Par) (-1C Num) (-1 Nom)
1237: SELECT (Pron "minä") (*1 Sg1 LINK NOT *-1 CLB?) (NOT 1 CLB?)
1645: SELECT (Sg1) (-1C MINA) (-1 Nom)
2094: MAP (@SUBJ>) TARGET Nom (0 WORD LINK *1 (Act))
2109: MAP (@<OBJ) TARGET Par IF (0 WORD LINK *-1 V BARRIER S-BOUNDARY2) ;
2115: MAP (@+FMAINV) TARGET VFIN IF (NEGATE *0 VERB BARRIER S-BOUNDARY2 OR CC) ;
Then following this disambiguation, several tags are added for later convenience... One tag, @SUBJ> tells us that the word is the subject of the sentence, preceding the verb; @+FMAINV tells us that the word is the main verb, @X tells us there is more work to be done yet; and @<OBJ says that the word is an object following its verb. The tags are shortcuts for passing along information for the generation part of the translation, in which words are produced based on the analysis. The full disambiguation is next, but note that the tags and analysis may not be correct yet; I'm just pulling this from the project as-is. Lines beginning with a semicolon (;) are those which are dropped from the analysis
"<minä>"
"minä" Pron Pers Sg Nom @SUBJ> SELECT:1237 MAP:2094
; "mikä" Pron Interr Sg Ess SELECT:1237
"<luin>"
"lukea" V Act Ind Prt Sg1 @+FMAINV SELECT:1645 MAP:2115
; "luu" N Pl Ins SELECT:1645
"<kaksi>"
"kaksi" Num Card Sg Nom @X MAP:2348
"<kirjaa>"
"kirja" N Sg Par @<OBJ SELECT:1187 MAP:2109
; "kirjata" V Act Ind Prs Sg3 SELECT:1187
; "kirjata" V Ind Prs ConNeg SELECT:1187
; "kirjata" V Act Imprt Sg2 SELECT:1187
So, there's more work to be done. As I dig further in, I may post a few recipes if there are tricky problems that arise. I'll be running some newspaper sentences through the grammar to see what additional things need to be worked out; the rules work fine for short sentences, but it may be that they'll not hold up when applied to much more complex sentences. As you can see in the lines of code produced above, there are BARRIERs involved, which delimit the ability of the rule to search its surroundings. More of these will likely pop up as weirder sentences are tested.
As it turns out though, the above analysis for this sentence is actually enough to produce a good translation. Once all the words are disambiguated, they're sent off to a generator, which produces the following (with slashes representing dialectical variation):
$ echo "minä luin kaksi kirjaa" | fin-sme
mun/mon lohken guokte girjji/girjje
The sentence also shows the connection between the two languages that the project concerns, if you squint you can see their relatedness.
![[Atom/RSS icon]](/m/img/feed.png)
#1: Corcaighist (08:06) - 7 Jun 2010
I wish I could understand stuff like this!
#2: Kevin Unhammer (09:06) - 7 Jun 2010
Corcaighist: the Finnish or the CG? (The latter is easy, just read
http://kevindonnelly.org.uk/2010/05/constraint-grammar-tutorial/ ; Finnish however…)
#3: Corcaighist (05:06) - 8 Jun 2010
The Finnish I understood. It was the CG that was troubling. Thanks for the link!