Difference between revisions of "Uni"
From GJ
(+diplomarbeit) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Eine Auswahl von Hausarbeiten und Vortragsfolien aus der [http://de.wikipedia.org/wiki/Computerlinguistik Computerlinguistik]. | Eine Auswahl von Hausarbeiten und Vortragsfolien aus der [http://de.wikipedia.org/wiki/Computerlinguistik Computerlinguistik]. | ||
+ | |||
+ | == Diplomarbeit == | ||
+ | |||
+ | [http://jaehnig.org/wiki/images/0/02/Jaehnig_-_Unsupervised_Morpheme_Segmentation_using_Formal_Analogies.pdf Unsupervised Morpheme Segmentation using Formal Analogies], supervisors: Dr. Gerlof Bouma, Dr. Sebastian Varges, July 2011, Introduction: | ||
+ | |||
+ | :''Morpheme Segmentation'' is a well known problem in computational linguistics concerning the task of segmenting words into meaningful units (morphemes). It is often one of the first steps in natural languages processing, right after sentence splitting and word tokenization. Splitting a word into morphemes simplifies further steps like Part-of-Speech tagging, since the number of morphemes in a language is usually smaller than the number of word forms (which often may also be infinite). Thus, morpheme segmentation allows further processing steps to use a smaller lexicon or makes them even possible at all. | ||
+ | |||
+ | :''Unsupervised'' morpheme segmentation (in contrast to ''supervised'') refers to approaches that do not take into account knowledge about the specific language they are segmenting, i.e. these approaches are language-independent. Because of this, such approaches are suitable for morphologically not well-studied languages. And such languages then usually suffer from a low number of speakers, and thus also from a low availability of (digital) data. This makes it hard for quantitatively oriented segmenting approaches to build a model. But also in data of well-studied languages previously unseen words can occur which a segmenter needs to deal with. For such cases, even supervised segmenters may find aspects from unsupervised ones useful. In the annual competition ''[http://www.cis.hut.fi/morphochallenge2010/ Morpho Challenge]'', new approaches of unsupervised morpheme segmentation are regularly competing. | ||
+ | |||
+ | :''Formal analogies'', the other part of this thesis topic, is a notion based on the concept of analogy, having its roots in ancient philosophy. An analogy describes a relation between four elements as "A is to B as C is to D", for example: "an electron is to the nucleus asa planet is to the sun"; (Lepage 1998). Formal analogies can be seen as a subset of analogies where the analogy is based on the ''form''. In case of words, this form is described by their characters, e.g. "''lay : lays :: say :: says''" (writing ":" for "is to" and "::" for "as"). The big advantage of formal analogies is that they can be validated automatically by a computer without any knowledge about the language of the words but only by looking at the set and order of their characters. This checking algorithm divides the words into factors and these factors do often resemble morphemes, an observation leading to the idea that finding and checking formal analogies within a given wordlist may produce good morpheme candidates. | ||
+ | |||
+ | :In my diploma thesis (Diplomarbeit) I want to study the potential of Unsupervised Morpheme Segmentation using formal analogies. I will examine previous approaches and implement a segmenter based on formal analogies. | ||
+ | |||
+ | :The overall structure is as follows: In chapter 2, I will give on overview on unsupervised morpheme segmentation, in chapter 3 I will do the same for formal analogies. Chapter 4 provides a outline on particular segmentation approaches. In chapter 5, I present in detail my approach which is going to be evaluated in chapter 6. The thesis ends with a Conclusion in chapter 7. | ||
== Analogien == | == Analogien == | ||
Line 9: | Line 23: | ||
== Endliche Automaten und Semiringe == | == Endliche Automaten und Semiringe == | ||
=== Hidden-Markov-Modelle === | === Hidden-Markov-Modelle === | ||
+ | * [http://jaehnig.org/wiki/images/3/3d/Creating_Shuffle_Automata_using_Composition_and_Intersection.pdf Creating Shuffle Automata using Composition and Intersection], Short Instruction, a byproduct of my diploma thesis, January 2011, Introduction: | ||
+ | |||
+ | :Creating an automaton that accepts the shuffle language of two given automata is usually done by the shuffle operation defined on the states of the automata. This short instruction shows how a shuffle automaton can be created using standard algebra operations like composition and intersection. It might be useful when working with a FSM framework that has no shuffle operation implemented. | ||
+ | |||
* [http://jaehnig.org/wiki/images/e/e2/Hidden-Markov-Modelle_als_Gewichtete_Endliche_Automaten.pdf Hidden-Markov-Modelle als Gewichtete Endliche Automaten], Hausarbeit in Co-Produktion mit Alexander Becker zum Hauptseminar '''Theorie semiringgewichteter Automaten''' bei Dr. Thomas Hanneforth, SS 2007, Zusammenfassung: | * [http://jaehnig.org/wiki/images/e/e2/Hidden-Markov-Modelle_als_Gewichtete_Endliche_Automaten.pdf Hidden-Markov-Modelle als Gewichtete Endliche Automaten], Hausarbeit in Co-Produktion mit Alexander Becker zum Hauptseminar '''Theorie semiringgewichteter Automaten''' bei Dr. Thomas Hanneforth, SS 2007, Zusammenfassung: | ||
Line 21: | Line 39: | ||
* [http://jaehnig.org/wiki/images/2/2b/Longest_Match_in_Directed_Replacement.pdf Longest Match in Directed Replacement.pdf], Präsentations-Folien dazu. | * [http://jaehnig.org/wiki/images/2/2b/Longest_Match_in_Directed_Replacement.pdf Longest Match in Directed Replacement.pdf], Präsentations-Folien dazu. | ||
+ | |||
+ | === Operationen und Semiringe === | ||
+ | Kurze Übersichten / Small reference sheets: | ||
+ | * [[:Image:Operations on Semiring-Weighted Finite State Machines.pdf|Operations on Semiring-Weighted Finite State Machines]] | ||
+ | * [[:Image:Semiring properties.pdf|Semiring properties]] | ||
== Verschiedenes == | == Verschiedenes == | ||
* [http://jaehnig.org/wiki/images/a/a9/Intonation_im_Berlinischen.pdf Intonation im Berlinischen], Präsentations-Folien zum Proseminar '''Regionale Variation in der Intonation''' bei Dr. Frank Kügler, SS 2005 | * [http://jaehnig.org/wiki/images/a/a9/Intonation_im_Berlinischen.pdf Intonation im Berlinischen], Präsentations-Folien zum Proseminar '''Regionale Variation in der Intonation''' bei Dr. Frank Kügler, SS 2005 | ||
* [http://jaehnig.org/wiki/images/1/1e/Left-Corner-Parser.pdf Left-Corner-Parser], Präsentations-Folien zum Proseminar '''CL-Techniken in Prolog''' bei Dr. Thomas Hanneforth, SS 2004 | * [http://jaehnig.org/wiki/images/1/1e/Left-Corner-Parser.pdf Left-Corner-Parser], Präsentations-Folien zum Proseminar '''CL-Techniken in Prolog''' bei Dr. Thomas Hanneforth, SS 2004 |
Latest revision as of 23:21, 2 August 2011
Eine Auswahl von Hausarbeiten und Vortragsfolien aus der Computerlinguistik.
Contents
Diplomarbeit
Unsupervised Morpheme Segmentation using Formal Analogies, supervisors: Dr. Gerlof Bouma, Dr. Sebastian Varges, July 2011, Introduction:
- Morpheme Segmentation is a well known problem in computational linguistics concerning the task of segmenting words into meaningful units (morphemes). It is often one of the first steps in natural languages processing, right after sentence splitting and word tokenization. Splitting a word into morphemes simplifies further steps like Part-of-Speech tagging, since the number of morphemes in a language is usually smaller than the number of word forms (which often may also be infinite). Thus, morpheme segmentation allows further processing steps to use a smaller lexicon or makes them even possible at all.
- Unsupervised morpheme segmentation (in contrast to supervised) refers to approaches that do not take into account knowledge about the specific language they are segmenting, i.e. these approaches are language-independent. Because of this, such approaches are suitable for morphologically not well-studied languages. And such languages then usually suffer from a low number of speakers, and thus also from a low availability of (digital) data. This makes it hard for quantitatively oriented segmenting approaches to build a model. But also in data of well-studied languages previously unseen words can occur which a segmenter needs to deal with. For such cases, even supervised segmenters may find aspects from unsupervised ones useful. In the annual competition Morpho Challenge, new approaches of unsupervised morpheme segmentation are regularly competing.
- Formal analogies, the other part of this thesis topic, is a notion based on the concept of analogy, having its roots in ancient philosophy. An analogy describes a relation between four elements as "A is to B as C is to D", for example: "an electron is to the nucleus asa planet is to the sun"; (Lepage 1998). Formal analogies can be seen as a subset of analogies where the analogy is based on the form. In case of words, this form is described by their characters, e.g. "lay : lays :: say :: says" (writing ":" for "is to" and "::" for "as"). The big advantage of formal analogies is that they can be validated automatically by a computer without any knowledge about the language of the words but only by looking at the set and order of their characters. This checking algorithm divides the words into factors and these factors do often resemble morphemes, an observation leading to the idea that finding and checking formal analogies within a given wordlist may produce good morpheme candidates.
- In my diploma thesis (Diplomarbeit) I want to study the potential of Unsupervised Morpheme Segmentation using formal analogies. I will examine previous approaches and implement a segmenter based on formal analogies.
- The overall structure is as follows: In chapter 2, I will give on overview on unsupervised morpheme segmentation, in chapter 3 I will do the same for formal analogies. Chapter 4 provides a outline on particular segmentation approaches. In chapter 5, I present in detail my approach which is going to be evaluated in chapter 6. The thesis ends with a Conclusion in chapter 7.
Analogien
- Textgenerierung mit Analogien, Hausarbeit zum Hauptseminar Textgenerierung bei Prof. Manfred Stede, WS 2007/08 Zusammenfassung:
- hacken verhält sich zu hackte wie backen zu backte - solche sprachlichen Analogien werde ich in dieser Arbeit beschreiben und formal definieren. Darauf aufbauend habe ich eine vorliegenden Textgenerator implementiert, der aus den gegebenen drei Phrasen eine vierte generiert, so dass eine Analogie entsteht. Diesen heuristischen Textgenerator werde ich ausführlich beschreiben sowie meine Verbesserungen vorstellen. Schließlich zeige ich mit einigen generierten Wörtern und Sätzen, welche Art von Analogien im Deutschen existieren und was der Generator produzieren kann.
- Analogien - eine kleine Einführung, Vortrag auf der 18. TaCoS.
Endliche Automaten und Semiringe
Hidden-Markov-Modelle
- Creating Shuffle Automata using Composition and Intersection, Short Instruction, a byproduct of my diploma thesis, January 2011, Introduction:
- Creating an automaton that accepts the shuffle language of two given automata is usually done by the shuffle operation defined on the states of the automata. This short instruction shows how a shuffle automaton can be created using standard algebra operations like composition and intersection. It might be useful when working with a FSM framework that has no shuffle operation implemented.
- Hidden-Markov-Modelle als Gewichtete Endliche Automaten, Hausarbeit in Co-Produktion mit Alexander Becker zum Hauptseminar Theorie semiringgewichteter Automaten bei Dr. Thomas Hanneforth, SS 2007, Zusammenfassung:
- Hidden-Markov-Modelle sind verbreitete stochastische Modelle, die Zufallsprozesse beschreiben. Sie verwenden dazu eigene Algorithmen wie den Forward- und den Viterbi-Algorithmus. Wir dokumentieren, wie Hidden-Markov-Modelle in Gewichtete Endliche Automaten und Transduktoren umgewandelt und wie Forward- und Viterbi-Algorithmus als Semiring dargestellt werden können. Abschließend betrachten wir einzelne nützliche Eigenschaften Gewichteter Endlicher Automaten: deren Determinisierbarkeit, die Epsilon-Entfernung und das Produkt zweier Automaten.
- Hidden-Markov-Modelle als Gewichtete Endliche Automaten, Präsentations-Folien dazu.
Longest Match
- Longest Match mit Gewichteten und Ungewichteten Endlichen Automaten, Hausarbeit in zum Hauptseminar Theorie und Anwendungen endlicher Automaten und Transduktoren bei Dr. Thomas Hanneforth, WS 2007/08, Zusammenfassung:
- Mehrdeutige Ersetzungsregeln können zu mehrdeutigen Ausgaben führen. Nicht immer ist das gewollt. In dieser Arbeit stelle ich 2 Ansätze von (Karttunen 1996) und (Hanneforth 2005) vor, die zunächst die Mehrdeutigkeiten einordnen und mehrere Operatoren zur Unterscheidung einführen und danach Lösungen mit Endlichen Automaten vorstellen, die eindeutige Ausgaben dieser mehrdeutigen Ersetzungsregeln erzwingen. Beide Algorithmen beschreibe ich ausführlich mit Beispielen und stelle auch deren Unterschiede heraus, die sich sehr auf Komplexität und damit Anwendbarkeit auswirken. Einen der Ansätze vereinfache ich leicht.
- Longest Match in Directed Replacement.pdf, Präsentations-Folien dazu.
Operationen und Semiringe
Kurze Übersichten / Small reference sheets:
Verschiedenes
- Intonation im Berlinischen, Präsentations-Folien zum Proseminar Regionale Variation in der Intonation bei Dr. Frank Kügler, SS 2005
- Left-Corner-Parser, Präsentations-Folien zum Proseminar CL-Techniken in Prolog bei Dr. Thomas Hanneforth, SS 2004