# Language Identification (Alan Du, 9/16/13)

We talked about grammatical inference — how you would identify a language (and, by extension, how none of us have ever actually learned a language). Slides are available here.

NOTES:

# Basic Concepts

We define an alphabet ∑, which we define as just some set of symbols. Some really common ones are the binary alphabet {0, 1}, and the Roman alphabet {A-Z, a-z}. For our purposes, it’ll probably be convenient to make our alphabet a set of words, like {He, she, ran, move, can, …}.

A string is a ordered sequence of symbols from the alphabet. Think programming. So, “010101” is a string w/ the binary alphabet, “hello” is a string w/ the Roman alphabet, and “He can move” is a string from our word alphabet.

A language is just some set of strings. Usually, we’ll talk about infinite languages, languages which have an infinite number of strings.

grammar is a set of rules that determine whether a string fits in a given language.

# The Gold Standard

In 1967, EM Gold published his Language Identification to the Limit, which was the standard that all grammatical inference aspired to.

In language identification, we start with some family of languages L*. We choose some language L from L*, which is the target language. We then generate some text T (a text is just a sequence of strings) from L. The goal is to design some algorithm A to identify L from L* based on T.

Let’s give an example to clear things up. Suppose we want to have some machine that will tell you what language a book is written in. Then, our family of languages, L*, is just the set of natural (human) languages. T is just the text of the book, and A is just the machine that actually does the identifying. Our target language L, is the language that the book is actually written in.

A language is identifiable in the limit if after some finite number of strings in T, our algorithm will converge and guess the correct language. In other words:

$\lim_{n\rightarrow\infty} A(T_n) = L$, where $T_n$ is the text with n strings.

So, what kind of languages are identifiable in the limit? Well, it turns out only finite languages (languages with only a finite number of strings) are learnable. As long as there is a single infinite language (e.g. a superfinite language class), then you can’t learn.

Why? Well, the proof isn’t terribly difficult. Suppose that we’re trying to identify some infinite language L (e.g. English). We can then construct some text T so that no algorithm can guess L. We first present all strings in L that only have two words (eg. {“I am”, “he is”, “they are”, …}). Algorithm A will guess that T is from English restricted to only two words. Then, we present all strings in L with only three words. Algorithm A will then have to change its guess. We then present L with only four words, then 5 words, ad infinitum.  Algorithm A will never converge onto L; it’ll only converge onto successive approximations of L.

Let’s pause here and consider the implications. Essentially, we’re saying that NO interesting language is learnable.  This is obviously problematic. After all, humans learn language everyday. If it’s impossible, then how are we doing it?

# Relaxing the Constraints

One theory is that the Gold standard is way too strict. After all, we can never be completely sure that we’ve learned a language correctly. But we can become very close to completely sure. In other words, although we’re not 100% certain, we have an extremely high probability of being right.

There are two ways of formalizing this notion. One of them is called measure-1 learnability, while the other is called probably approximately correct (PAC).

Let’s talk about measure-1 learnability first. Measure-1 learnability is essentially the same thing as identifiable to the limit. The only difference is that instead of requiring that A converge on the correct language, we require the probability that A is correct to approach 1.

The other is PAC. For PAC, we observe that some languages are closer than others. For example, if I had some book in my hands that was written in Italian, I would be much closer if I guessed French than if I guessed Mandarin. Or, if it was written in Arabic, I’d be much closer to guess Farsi than if I guessed Cherokee. So we define some distance function D that quantifies the distance between two languages.

Then for PAC, we require that as we see more and more text, the distance that we observe gets smaller and smaller. More formally, $\lim_{n\rightarrow \infty} Pr[ D(A(T_n), L) > \epsilon ] = 0$.

Let’s stop and consider that monster for a second. We’re looking at the probability that our guess is some distance $\epsilon$ away from the actual answer. In other words, the probability that we are approximately correct. And we know the limit states that as we see more and more text, then the probability that we are approximately correct approaches 1. In other words, we are probably approximately correct.

Just a small note: I believe that PAC and measure-1 learnability are the same thing, but I’m not sure. Just in case, I’ve made sure to use the right terminology below.

Horning and PCFGs:

In 1969, James Jay “Jim” Horning published a thesis about language identification. He looked at a way to actually compute probabilities of guessing correctly, so we could actually study which languages are PAC learnable and measure-1 learnable.

He did this by using Probabilistic Context Free Grammars (PCFGs). A PCFG lets you calculate the probability of a sentence by using the rewrite rules that we discussed earlier. In a PCFG, you first assign each rewrite rule some probability. To get the probability of a sentence, you multiply all the probabilities of the individual rewrite rules together, assuming that each rewrite rule is independent of each other.

Horning proved, through some clever Bayesian inference-type stuff, that if you assume languages are generated by PCFGs, then ALL languages are PAC learnable. Why? Well, because in a PCFG, rewrite rules are assumed to be independent. So by increasing the number of rewrite rules used in a sentence, you exponentially decrease the probability of that sentence. At some depth, the sentences have such low probability that you can ignore them. So, in effect, we’ve found a way to transform all languages into finite languages, which we already knew were learnable.

There’s a problem here though. We assumed that languages were generated by PCFGs, which is a VERY large assumption, and also one that is definitely untrue for humans (although it’s a neat simplifying assumption). But if we make NO assumptions about how languages are generated (ie they are distribution free), then Dana Alguin proved that any language that is PAC learnable is also Gold learnable. In other words, switching to probabilities didn’t actually help us if we make no additional assumptions.

# Computational Intractability

One problem that we’ve been completely ignoring is how to efficiently identify languages. Hornning’s used an enumerative approach; essentially, he just guessed almost all possible languages. But this is impossible for any large grammars: Horning proved that the number of possible grammars with N words is $\Theta(2^{n^3})$. For those who don’t know their big theta notation, that is really freaking huge. If we only have 7 words, then we have somewhere around 1800 googol possible grammars. Way more grammars than we can brute force.

Unfortunately, Gold proved that there will always be some cases where brute force works the best. And more generally, Bayesian inference, which would get us an approximate answer, is usually NP-hard. NP-hard problems can’t be solved for in any reasonable amount of time. So here, we have another problem: humans learn most of their grammar in just three years. How can they do it so quickly?

# The Solution

How do linguists account for these problems? Well, linguists (or at least theorists like Dr Hornstein) believe that this is just more evidence for a built-in faculty of language. Remember, language identification is only impossible if we don’t know what process generates the language. But if the rules are built in, then Horning should that grammars should at least be PAC learnable.

These rules must also significantly constrain the set of languages we can learn. After all, if there are too many possible languages to differentiate in any reasonable amount of time. That means a priori, humans must be restricted to a very small subset of languages.

On a side note, this means that communication with aliens may be impossible. If aliens don’t use the kinds of grammars that humans can learn, then we’ve shown that it’d take too long to learn and is probably impossible (this is a gross simplification, of course).

Another solution is to argue that everything we’ve done here doesn’t apply to humans. After all, we’ve only looked at text-based identification. But we know that babies don’t just learn English by listening to it. They also try to speak it, and they get feedback on their sentences. We’ve totally ignored how feedback changes the picture (for the curious: adding feedback makes most languages Gold learnable. I believe it’s still NP-hard, but I’m not sure).

# Summary (TL;DR):

Most languages aren’t exactly identifiable. In fact, you can’t even be probably approximately correct. And even if you could be sure, it would take a horrendous amount of time to inspect every possible grammar that fits the sentences you observed. That means that humans can’t possibly consider every possible language when they’re learning. Instead, they only even consider a very small (relative to infinity) subset of languages.