Linguistics of Word Research Paper

In thinking  about  words, the first question  is how to divide an utterance  up into them. The simple answer is that words are demarcated by spaces, just as they are on this page. But this simple answer depends  on the existence of writing.  In  speech, we do  not  normally leave a space or pause between words. Most languages throughout  history   have  not   been  written   down. Surely  we  do  not  want  to  say  that   only  written languages have words, and, even with written languages, spacing does not provide an entirely satisfying answer.  For   example,  English  compounds  can  be spelled in three  ways: open,  closed,  or  hyphenated, and  some items can be spelled in any of these three ways without  being  affected  in any  detectable  way: birdhouse, bird-house, bird house. We do not want  to say that the first spelling is one word, the last spelling two words, and the middle one neither one word nor two, which we would have to do if we accepted spaces as criterial. The better conclusion is that spelling conventions are not a completely reliable clue to whether something is a word or not.

Some linguists avoid the problem  by claiming that the whole notion ‘word’ is theoretically invalid, just an artifact  of spelling. In  their  favor  is the  fact,  which people are always surprised to learn, that not all languages have a word for ‘word.’ The classical languages, biblical Hebrew, classical Greek, and classical Latin, for example, all have terms that are systematically ambiguous among ‘speech,’ ‘word,’ and ‘utterance’ but none has terms that distinguish clearly among these notions  and certainly none has a special term that means just ‘word.’ Even in the opening verse of  the  Gospel  of  John,  ‘In  the  beginning  was  the Word,’ it is still not clear just what is the meaning of the Greek  word λογο   [logos] that  we conventionally translate as ‘word.’ Many scholars believe that the best translation is ‘thought’ or ‘reason.’

The classical languages  are not alone. The anthropologist Bronislaw Malinowski declared that the distinction  between  word  and  utterance  was not  an obvious one to most peoples, that ‘isolated words are in fact only linguistic figments, the products of an advanced linguistic analysis’ (Malinowski 1935, p. 11), suggesting that there was no reason for most languages to distinguish  between  words  and  utterances. Nonetheless,  most  modern  linguists  believe that  all  languages  do  have  words,  whether  their  speakers  are aware of the units or not. The question  then becomes how to figure out what a word is and how to identify words in a way that is valid for all languages, written or spoken,  and,  since language  is first and  foremost spoken,  our answer must not depend on writing.

The earliest explicit discussion that  we have of the notion  ‘word’ and of words in the speech stream is in the work of Aristotle. Aristotle made a distinction between an utterance  or sentence, for which he used the term λογο   [logos] (confusingly, the term that  we usually  think  of  as  meaning  ‘word’),  and  a  word, which he called: µερο   λογου [meros logou], literally ‘piece of an utterance.’ Aristotle’s term is still with us, since it was translated literally into the Latin term pars orationis, from  which in turn  we derive the  English term ‘part of speech’ by a similar literal translation. He defined this new entity as a component of a sentence, having a meaning of its own and not further  divisible into  meaningful   units  (De  interpretatione  2–3).  In other words, Aristotle saw the speech stream as being divided up into atomic pieces, each being what we now call a word. The problem of defining words and identifying them was thus one and the same for him. Aristotle, incidentally, identified only two types of words, or parts  of speech, which correspond roughly to our own nouns and verbs.

It is difficult for us to appreciate  the importance of Aristotle’s realization  that the speech stream could be divided into  pieces, because  we are aware  of words. We have been taught  to think of language in terms of words  and  if we think  about  speech,  it is not  as a stream, but as words lined up one after the other, like pearls separated by knots on a string. But in a culture without  writing or in a culture such as that of ancient Greece,  in which writing  played  an  extremely  small part in the life of even the most educated  people, this conception  is not at all obvious.  After all, the philosophical school of Aristotle was called peripatetic because  his method  of teaching  consisted  of discussion, conducted  while walking about in the Lyceum of Athens. This intellectual style was purely oral, not written,  and  for  both  Plato  and  Aristotle,  language was spoken,  not  written.  Indeed,  Aristotle’s  teacher Plato  had a deep distrust  of writing, which he called inhuman.  This  role  of spoken  language  in classical Greece must be underscored because spoken language, unlike  modern   written   English,  is  not  a  string  of pearls.  We do not speak in single words with spaces between,  and  so the  discovery  that  speech could  be broken   down  into  smaller  component  words  was indeed remarkable.

Was it an accident that Aristotle made his discovery about  words in one of the first societies with writing? Many people would say that written language, even an alphabet, is a prerequisite  for linguistic awareness: we cannot  begin to analyze the stream  of language until we  see  it  recorded   in  writing.  This  claim  faces  a number  of obstacles,  most prominently the fact that the most advanced system of linguistic analysis known prior   to  the  advent   of  modern   linguistics  in  the nineteenth century was that of the Sanskrit grammarians,  the best known of whom was Panini. Panini flourished  around 500 BCE.  His  grammar  was embedded within an oral tradition of grammatical analysis of the Sanskrit  sacred texts that  must have begun long before him. Strikingly, we have no evidence at all of writing  in Sanskrit  until  at  least several hundred years after Panini,  leading to the conclusion  that  the Paninian  tradition of linguistic analysis, which recognized not  only words,  but  also the internal  parts  of words,  and  used the notion  of ‘zero’ long before the Hindu mathematicians who introduced it to the rest of the  world,  existed  before  the  advent  of  writing  in India.

So writing is not a prerequisite  to recognizing that the  stream  of  language  can  be  broken   down  into words. Even so, having language written  down as an object  must  have  assisted  in the  recognition  of  the words. The evidence from early writing is tantalizing, but  mixed.  On  the  one  hand,   the  earliest  writing systems, notably Sumerian cuneiform and Egyptian hieroglyphics, had no consistent means of dividing the text into words. Similarly, ancient Semitic and Greek texts were usually written continuously, with no spaces between words, and the practice extended into medieval times for Latin.  Even among modern  languages, Chinese,  to  this  day,  is always  written  without  any spaces between words. On the other hand, later cuneiform  scripts did mark  the boundaries of words and the Etruscans, whose alphabet was based on the Greek alphabet, marked  these from quite early on by means of a centered dot, a practice  continued  by the Romans:  when one looks at the inscriptions  on such monuments of  Imperial  Rome  as  Trajan’s  column, inscribed in the same Roman  forms that we use most commonly  today  on our word processors  and  which are consequently  so strikingly readable, one sees these same dots separating  the words. In the Semitic scripts that  descend from the first alphabet there are special forms of certain letters that appear only at the ends of words. The Greek letter sigma has two forms, one of which is used only at the ends of words.  Even when there were no spaces between words, these languages could not be written  without  knowing  where a word ended.  Finally,  the Aramaic  scripts  (notably  Syriac, in  which  many  important documents   of  the  early Christian  church  were written,  and  which is still important for Christian  scholars today) were the first to develop a cursive form, where all the letters of a word are joined together into a single unit, again impossible without  knowing where a word begins and ends.

So there  is evidence  from  early  in the  history  of writing  that,  although a  writing  system  can  record language  without  breaking  it down  into  words,  it is equally   possible   to  represent   the  breaks   between words. What permits people to do this reliably? How do we know where one word  ends and  the next one begins in spoken language?

For some languages, the answer lies in sound, specifically word stress or word accent. In the simplest case, every word  in the  language  is stressed  on  the same syllable, so that  one can quite  easily recognize the words  directly from  the stresses. Hungarian and Czech,   for   example,   have   inexorable   word-initial stress, while Persian  has word-final  stress and Polish has stress on the second-to-last syllable. In all of these cases, even if a person does not speak the language, he or she can usually tell directly from the stress where the word breaks are. Even a computer, properly programmed, would be able to break up the speech stream of these languages into words without  knowing what any of these words mean.

Of course, not all languages are quite so simple. In English  and  Russian,   for  example,  although  every word  has  a  main  stress,  so  one  can  determine  the number  of words in an utterance  from the number  of main stresses, the stress patterns of these words are not easily predictable;  some words are stressed on the first syllable (e.g., ‘sympathize’), while others  are stressed on the last (e.g., ‘kangaroo’) and others on the secondto-last (‘remember’) or even the third-to-last (‘anomaly’). A  fluent  speaker  just  knows  for  every word where the stress lies and speakers seldom make mistakes. No one says ‘sympathize,’ or ‘kangaroo,’ or ‘anomaly.’  So  an  English  speaker   has  no  trouble telling where the word breaks are, but the task is quite overwhelming for a learner  or for a simple computer program. In the most problematic type of languages for a stress-based  method  of word  division,  as with French  or many of the languages of India, there is no word  stress,  so  that  phonology   cannot  be  used  to detect word division in a simple way.

But French  still has words, so there must be some nonphonological way  to  isolate  words.  One  clever test that linguists use for word divisions involves interruption. Take  a  sentence  like The former lead remained inside the White House,  which  is here  written without   spaces  in  order  to  emphasize  the  difficult nature of the task. Where can this sentence be interrupted? We may understand this question  in two ways.  One  involves  pauses:  where  is it  possible  to pause naturally?  The other  involves insertion:  where can we insert elements in the string without  affecting its  structure.   Curiously,   the  two   give  us  slightly different answers. The natural-pause criterion  identifies words as follows: Theformer leader remained inside the  White  House.  Insertion   gives  us  the  following breaks  (with  inserted  elements  in parentheses):  The (recalcitrant) former (American) leader (stubbornly) remained  (ensconced )  inside  (precisely)  the  (stately) White   House.  The  resulting   sentence  may  not  be elegant, but it is English. We do not normally  pause after  the definite article ‘the’ in English,  because the article  is what  linguists call a clitic, an element  that cannot  be pronounced as a word all by itself, though other  criteria  such  as insertion  may  identify  it as a word.  We cannot  either  pause  or  insert  an  element between the two elements of ‘White House’ because, despite the space in the standard orthography, ‘White House’ is a single compound word. We may then use the twin criteria of possible natural pause and insertion in order  to  identify  the  breaks  between  words  in a spoken  utterance. These  criteria,  though  seemingly simple,  depend  on  the  possibility  of  pause  and  insertion. A learner thus has to compare numerous utterances  before being able to break an utterance  up reliably into words.

Where to pause or insert may seem obvious (though again our familiarity  with written  language may give us  false  confidence),   but   there   are  places  in  our sentence  where,  although linguistic  analysis  tells us there  are breaks  between  elements,  these breaks  are below the word  level, and  where, though  a machine might be tempted, both pause and insertion are impossible for speakers of the language. For example, ‘leader,’ ‘remained,’ and ‘inside’ can easily be divided into ‘lead-er,’ ‘remain-ed,’ and ‘in-side,’ yet no English speaker would insert an element or pause at these divisions.  Here  we are  dealing  with  the  meaningful parts  of words,  which linguists call morphemes,  the divisions between which are not easily recognized by speakers.

Aristotle  identified two types of words, nouns  and verbs. School grammars permit more: adjectives, adverbs,  pronouns, prepositions, and  conjunctions. Linguists  call these different  types lexical categories and  distinguish  between open  and  closed categories. Open categories are those to which new members may be added easily, either by word formation or by borrowing  from other languages. In English, the open categories are noun,  verb, and adjective. We add new nouns at a tremendous rate, verbs and adjectives more slowly. The other  categories are closed: for example, no  new  pronoun has  been  added  to  English  since ‘them’ was  borrowed   from  Old  Norse  over  a  millennium ago.

The lexical categories differ from language to language.  The class of adjectives  is much  smaller  in many languages and in some, like the Dravidian languages spoken widely in South India, it has only a dozen or so members. Even the class of verbs may be very small in some languages  (e.g., Bengali). English also has more prepositions than most languages; some Austronesian languages  of Indonesia  have only one. Languages  may  even lack certain  lexical categories. English  is unusual  in having  a category  of adverbs separate  from  adjectives  and  many  linguists  believe that  the distinction  between the two is not even valid for  English.   It  has  also  been  claimed  that   some languages do not distinguish  nouns from verbs.

How  to  define the categories  is another  question. The traditional definitions  are  based  on  meaning:  a noun denotes a person, place, or thing; a verb denotes an action  or state; an adjective denotes  an attribute. But these definitions are problematic. Nouns like ‘love’ or ‘confusion’ denote  states,  not things,  as do adjectives like ‘solid’ and ‘angry,’ while relational nouns or verbs like ‘mother,’ ‘son,’ and ‘include’ fall under none of the  standard meaning-based definitions.  Modern linguists prefer instead  to define the categories grammatically, in terms of how they function in a sentence. A noun is thus the head or main word of a subject or object phrase,  an adjective  adverb  a modifier,  and  a verb the main word of a predicate  phrase.

Over the last century,  the major  research  preoccupation of linguists concerned with words has been with analyzing   their   internal   structure   or  morphology. Almost all languages  have internally  complex words, though  a few, notably  Vietnamese,  have very simple word structure,  limited largely to compounds, words formed by combining words, like English ‘shoelace’ or ‘boxcar.’ In most languages,  the majority  of complex words are formed with suffixes like the ‘-er’ and ‘-ed’ in ‘leader’ and ‘remained’ above. Somewhat less frequent across  language   are  prefixes  like  ‘un-’  in  ‘undo,’ ‘unpack,’ and ‘unroll.’ In English and many other languages,  prefixes and suffixes may combine,  resulting  in  words  like  ‘re-institut-ion-al-iz-ation’ or  the infamous ‘anti-dis-establish-ment-ari-an-ism.’ Less common   means   of  forming   complex   words   may involve internal  sound  changes like ‘mouse  mice’ or ‘goose  geese’  or  even  changes   in  stress  like  that between the verb ‘reject’ and the noun ‘reject.’ Another mechanism  is to  repeat  part  of  a  word:  in  ancient Greek, the past tense of a verb is produced by repeated the first consonant; the stem ‘graph’ (meaning ‘write’) becomes ‘gegraph’ in the past tense, and similarly for other verbs.

Linguists distinguish  two grammatical functions  of complex words,  inflection  and  derivation. The word inflection   descends  from   the  Latin   verb  meaning ‘bend.’ The idea is that  a word  bends  or adjusts  its shape to fit its context  in a sentence.  In English,  the vast majority of verbs inflect, or adjust their shape, for person and tense by suffixation in a simple fashion: the past tense  past participle form of most verbs is formed by adding  the suffix ‘-ed’ as in ‘remain-ed,’ the third person  singular  present  adds  ‘-s’ (‘remains’) and  the present participle adds ‘-ing’ (‘remaining’). Elsewhere, the verb is morphologically simple (‘remain’). We may say that  the  four  verb forms  (‘remained,’  ‘remains,’ ‘remaining,’ and the uninflected ‘remain’) constitute  a set, with the grammar  deciding the context  in which each member of the set will appear.  We call this set of forms a lexeme and we use the uninflected form (here ‘remain’) to  name  the set. For  most  English  nouns, there are two inflected forms in the set of each lexeme, the singular and the plural.

English inflection is very simple, but other languages are  much   more   complex.   In  many   of  the  Indo-European relatives  of English,  nouns  and  adjectives inflect differently depending on their function in a sentence,  the way English  pronouns do (‘I, me, my, mine’). A Russian noun or adjective will usually have ten  inflectional   forms.   Verbs  can  be  much   more complex.  Some languages  have noun  classes or genders,  a dozen  or  more  in the  Bantu  languages  that dominate  Africa, and in these languages  many verbs have  a  distinct  inflectional  form  for  every  gender. Verbs also often  inflect for tense and  for the person and  number  of the subject,  with the result that  verb inflection  may become baroque, with over two hundred forms for every verb in ancient Greek, a thousand or more in Navajo, and a theoretically infinite number for each verb in Turkish.  Experimental studies  have shown that  the inflected forms of a single lexeme are very closely tied to one another  in the mind.

The other mechanism for forming complex words is derivation, which results in new lexemes, rather  than contextually  limited forms of the same lexeme. Thus, ‘bakes,’ ‘baked,’ and ‘baking’ are all inflected forms of a single verb lexeme, ‘bake,’ but ‘baker’ and ‘bakery’ are  different   words,   nouns,   distinct   lexemes  from ‘bake,’ though  formed  by the  same general  method (suffixation) as the inflected forms. And each of these nouns in turn may be inflected: ‘bakers, bakeries.’ Two different lexemes like bake and baker will normally not be as tightly connected  in a person’s mind as are the forms of a single lexeme.

This brings us to the general question of how words are stored  in the human  mind, how we retrieve them when we speak or write, and how we recognize them when we understand language.  This is the frontier  of linguistic  research  on  words  and  it  involves  cooperation among linguists, psychologists, and neuroscientists. The results of this line of research are very exciting, though  not well enough established  to allow for  firm  conclusions.  What  we can  say  is that  the question  of what is a word will continue  to be fundamental, even  as  the  methods   of  research  on language become more and more sophisticated.


