[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

#2804: Creole Language Computer .... from Jeff Allen (fwd)


Regarding #2324 Creole Language Computer Software Development Update, Joel 
Dreyfuss wrote:

>What would be very useful for those of us who are 
>semi-literate in Kreyol is a Microsoft Word dictionary 
>and spell-checker in the language. Any possibility 
>that this could happen?

Marilyn Mason replied:

Subject : #2391: Creole Language Computer Software Development Update: from 
Marilyn Mason
Date : 2/18/00 2:14:05

>Yes, there is no technical reason why there could not 
>be a user-friendly, fully functional software dictionary 
>/spellchecker for Haitian Creole.
>But such an undertaking would require (not necessarily 
>in their order of importance): time, personnel, money, 
>leadership, coordination, and an enabling environment.

And Fequiere Vilsaint replied:

Subject: #2405: Creole Language Computer .... from Fequiere (fwd)
Date : 2/18/00 11:01:26 

>There is a Haitian Creole Spell Checker in the market 
>at least for 7 years. It is updated every time MS Word 
>or WP comes out with a new version. The MS Word version 
>works seamlessly with your MS Word word Processor, as 
>supplemental dictionary. Ditto for Corel Word Perfect. 
>For a quick description, visit: www.educavision.com and 
>click on Computer.

Jeff Allen now comments:

This is a bit lengthy, but it is packed with details.
I'm sending the message from my home e-mail address

As has already been stated on the Haiti mailing list, a couple of Kreyol 
spell-checkers, (and similar types of applications as I explain below) have 
already been created, but they (including one that I helped develop as the 
main linguist of the project) are not currently sufficient for the task of 
robust spell-checking. All of these projects and products have been important 
steps and stages in the process of developing better products, but there is 
still a lot of work to do to make a Kreyol spell-checker as reliable as one 
done by a Product Management team at MS Word.

Let me explain below.

Chris Hogan and I at Carnegie Mellon University created a spell-checker for 
Haitian Kreyol in 1997 and 1998 as part of our research on the development of 
several writing-, translation-, and speech-related tools. One of the 
by-products of our research included the prototype of an optical character 
recognition (OCR) output spell-checker. I say "by-product" because we did not 
intend to create a spell-checker at the start, but since all of the 
groundwork had been laid, it was possible to create one rapidly and to see 
the effects of the spell-checker on the implementation of our overall system.

We obtained a copy of the EducaVision spellchecker (catalog ref. # S101) in 
late 1996/ beginning 1997 and discovered that it was more or less a word 
glossary list adapted to the appropriate WordPerfect and MS Word dictionary 
formats. This is not to say that a glossary list is not important. It is in 
fact a fundamental and very important step in building a spell-checker. The 
starting point is obtaining a large database of electronic data. This is a 
very difficult step in working with electronically sparse-data languages of 
which Haitian Kreyol is one. Electronically sparse-data languages are those 
in which electronic data of the written is either hard to find or hard to 
obtain, and it is considerably less in volume when compared with other 
international languages. I've been working on several sparse-data languages 
for a few years in order to create boot-strapping methods and strategies for 
quickly developing systems under sparse-data circumstances.

As the EducaVision spell-checker is most likely based on a large portion, if 
not all, of the texts available in the EducaVision archives, the EducaVision 
team has a very good data resource to start with. In many cases of minority 
language projects that I am familiar with (for example, EMILLE minority 
language project in Europe, BOAS project at New Mexico State University, 
DIPLOMAT at Carnegie Mellon, etc), some teams have to actually type in all of 
the data by hand from the start (with the risk of typing errors, of course). 
EducaVision was at a vantage point on this issue because all of their data 
was already electronic.

At Carnegie Mellon University, with a UNIX-based computer system and a huge 
database (1.2 million word Kreyol language database that I collected from 15 
independent sources, including 3,071 sentences that were donated to the 
research project by EducaVision), my team created a very similar "derivative 
research" baseline glossary in a couple of hours, and even made it available 
on the Internet as announced on the Haiti Mailing list in the past (mid 1998 
from what I recall).

Granted, as some Haiti Mailing Listers have previously pointed out, the CMU 
glossary list contained not only correctly spelled words, but also "noisy" 
material, including misspelled words. We openly and fully acknowledged this 
point up front by stating that our glossary was strictly a term extraction 
list of every word form (whether good or not) that occurred 2 or more times 
in the entire database. Our glossary reflected the "written" Kreyol language 
as-is with significant samples from many existing resources.

[Important note about that data: if anyone is using the CMU released 
electronic Kreyol word list that has been provided to the public, it is 
essential that you cite this contribution in your research papers and product 
development descriptions. The US Fair Use law does not cover redistribution 
efforts of data use. If you are using that Kreyol data that took two years of 
development time to produce in the framework of a government-funded research 
project for CMU, do make sure you are including the full bibliographic entry 
in your reference list as it was found at the website where you located the 
data. The use of the data is free, but free does not mean without citing the 
origin of the data. Use of that data in part, in whole, or reproducable 
derivatives, without giving proper reference, is infringement upon 
international copyright laws. I'm the invited speaker at the TDCNet2000 
Terminology conference this coming Monday with a talk entitled "Legal and 
ethical issues regarding the diffusion and distribution of language resource 
and terminology databases". My current job includes providing legal counsel 
on electronic database laws at national and international levels.]

I also conducted quite a bit of background research in 1998 on how the 
British National Corpus, Brown Corpus, and other large-scale corpus projects 
have compiled their materials and provided representative, balanced, 
cross-sectional samples. My co-authored paper at LREC98 with Chris Hogan on 
Lexical coverage demonstrated our analysis of how we developed automated 
methods for improving balance and coverage for monolingual and bilingual 
(parallel) corpora and terminology lists. Also, several of my conference 
talks in 1998 and 1999 included analyses of the Kreyol glossary that was 
created on-site at CMU.

I know that not everyone necessarily agrees with my hypotheses about why 
there is such a high level of lexical variation in written Kreyol (the 
disagreement with "why" does not bother me), because this is a set of 
sociolinguistic and psycholinguistic issues that have not yet been measured 
in reliable and quantifiable ways. The important fact that cannot be denied, 
however, is the end result: there is a very high level of lexical form 
variation occuring not only between texts (ie "intertextual lexical 
variation"), but also within texts by the same groups (ie "intratextual 
lexical variation"). My SPCL98, AMTA98, EtudesCreoles99 conference talks all 
provide clear evidence of this with frequency word counts per word form and 
on each set of texts analyzed.

One of the main points that I have often mentioned from these several years 
of work is that there is a great need for a spell-checking mechanism. 
However, there are a number of important points that I have honestly 
struggled with over the years and that we should be aware of:

1) Yes, written language shows a great need for standardization if it is to 
be used in coherent and consistent ways in combination with a universal 
Kreyol spell-checker, but

2) Will Kreyol standardization research and development work end up just 
sitting on another university library shelf or on a CD-ROM as another 
educational research prototype project?

3) Will the results of such standardization work ever reach the level of 
grassroots literacy projects and/or is it needed for such projects?

4) Who will benefit from this work (Haitian diaspora in the USA, Canada, 
France and elsewhere; high school and college students students in Haiti; 
Haitian elite)?

5) Will there ever be Haitian government funding for such "language" work 
when there is still so much to do to just repair the roads in Port-au-Prince 
and to provide the basic necessities (food, water, clothing, shelter) for the 
Haitian people who for the most part are non-literate?

6) Who else will participate in a financial way in such "language tool" 
development work?

These are the kinds of questions I ask myself a lot because I have spent 
thousands of hours over the past 10 years working on them.

Now on to the issues of dictionary clean-up work, which is a very 
time-consuming task, as I will demonstrate below. Lexical glossary and 
dictionary clean-up work is a long, tedious task, but it is imperative that 
it be done, and done correctly, if any form (paperback, electronic) of a 
dictionary or glossary is to be used by a human being or by an electronic 

Let me give you some concrete examples.

* When I worked on the multilingual documentation project at Caterpillar Inc 
in 1995 and 1996, we had to first start with a technical-term dictionary 
clean-up task in English, before any work could be done on the word/term 
equivalents in the 11 other languages that we were translating documentation 
into. It took one person a period of one full year to read through a 1 
million word technical term glossary and reduce this glossary down to 50,000 
technical terms and common words. That was a project that had a lot of money 
and was conducting the terminology clean-up work on a multimillion dollar 
project for a definite return on investment for technical writing and 
translation purposes.

* I am currently in contact with several major translation and localization 
projects that all emphasize as well that terminological and dictionary 
compilation work is the core issue. I have been developing strategies to help 
them conduct this clean-up within the framework of the critical path for 
rapid, efficient and cost-effective results.

* On the Haitian Creole project at CMU, it took our team approximately 4-6 
person months of work to manually filter all of the unnecessary and noisy 
data (misspelled words, wrong words, etc) out of the glossary mentioned 
further above. The resulting cleaned dictionary was much more efficient for 
the translation system that we were developing.

As for making adapted products to MS word and other commercial software 
programs, this can be done in a very short period of time if one knows the 
relevant APIs and standards to be followed.

However, a robust and fully integrative spell-checker must go beyond the 
first step of being a reformatted glossary file. In order to create a top 
quality spell-checking application, it is necessary to ramp up the lexical 
coverage of the dictionary. Each time I hear of a new Creole dictionary that 
comes out with claims that it has 30, 40, even 50 thousand entries, I simply 
shake my head knowing that this is not true. This might refer to total 
entries, but not separate entries. From the 1.2 million word Kreyol database 
we developed at CMU based on 15 independent sources of Kreyol texts from 
different fields and domains, including some specialized areas, we only ended 
up with 33,564 different word types (that means differently spelled words) 
out of a total of 1,200,613 word tokens (that means the total number of 
individual words in the data, including repetitions). I write journal review 
articles on many Creole language dictionaries and know that no such reference 
can realistically claim more than a total of 10 thousand different word token 
entries. This is because of the manpower that is needed to do the cleanup 
work and to provide sample entries (for dictionaries), combined with the 
overall purpose of the dictionary, the domains represented, and the depth of 
work performed. And if it is a bilingual dictionary, then the equivalent 
entries need to be discounted because then it is considered as 
double-counting the same headwords. Simply adding the entries of one half of 
a dictionary to those in the second half, without discounting exact 
equivalents (where there are not cases of synonymy indicated), is cheating in 
a way from a lexical point of view for the sake of giving an impression of a 
higher number of entries than there really is.

Lexical coverage is a key issue. Due to the lack of lexical coverage found in 
all texts and applications done for Haitian Kreyol previous to Jan 1998 when 
Chris and I presented our first paper on the results of our research (ALLEN, 
J. and C. HOGAN. Evaluating Haitian Creole orthographies from a 
non-literacy-based pespective. Presented at the Society for Pidgin and Creole 
Linguistics annual meeting, 10 January 1998, New York), we created an 
algorithm called TEXT-R that was specifically designed to increase lexical 
coverage in texts to be translated by our on-site native speaking translation 
team members. The very successful results of our research were published a 
few months later (ALLEN, J. and C. HOGAN. Expanding lexical coverage of 
parallel corpora for the Example-based Machine Translation approach. In 
proceedings of the 1st international Language Resources and Evaluation 
Conference (LREC98) presented in Granada, Spain, 28-30 May 1998). From that 
research, we showed how it was possible to ramp up and thus significantly 
increase lexical coverage in dictionary and translation compilation tasks. In 
a period of 5 months (Oct 97 - March 98), we were able to increase the 
lexical coverage of one of our translated databases by the same amount of 
(new) items as had been compiled during the previous 1.5 years of the 
project. We then implemented this new program to our multiple translation 
projects and found that it was very effective for increasing the coverage of 
the dictionaries/glossaries, and making the translators' work easier, more 
productive, and less repetitive/boring.

Yet, terminology extraction and even lexical coverage are not enough. 
Mistakes in the words and other various issues indicate the need for 
additional tasks and tools. Chris H. and I then went on to develop a Kreyol 
accent mark re-insertion program for processing electronic Kreyol texts of 
Haiti Progres from which all accent marks had been removed for the 
electronically diffused version of the newspaper. We also created the 
prototype of an optical character recognition (OCR) output spell-checker to 
work on other types of imaged-based texts. There are several published 
articles and conference papers on this research.

>From our in-depth experience in this type of work for Haitian Creole and 
several other languages (Croatian, Korean, Spanish, French, Arabic), our 
research confirmed the work of many other computational linguists who develop 
spell-checking applications. There are several crucial points to be 
implemented in the development of an excellent spell-checker. These kinds of 
discussions appear all the time on many of the relevant e-lists.

Even the 5-8 person years of work invested into the entire Haitian Creole 
project at CMU did not lead to a fully-functional and robust enough 
spell-checker. We learned to develop modifications for our own specific needs 
with regard to the translation and related systems that we were developing.

More recently, I have spent the last year and a half working on the project 
funding and product distribution end of the cycle and know that developing a 
marketable product is not a short and simple task. Putting into place a 
Kreyol spell-checker that 1) can work on multiple platforms and processors, 
2) can identify non-words and propose valid words, 3) can identify 
misspellings and propose their appropriate equivalents, 4) can identify words 
and make an automatic decision that is based on grammatical context, and use 
other strategies of "fuzzy matching", etc, well folks this is going to take a 
lot of time and resource investment. It simply cannot be created overnight. 
Who is going to pay for that?

I know who provides the funding for the development of spell-checking modules 
of the major languages, and I know who wins the bids for such projects. Who 
will be the 1) paying and 2) development players for the next generation of 
Kreyol natural language processing (NLP) applications?

When new language processing tools appear, I investigate them very quickly, 
especially if they supposedly present novel ideas. I've seen a recent 
"attempt" or two for the development of Creole language processing tools, and 
they are unfortunately going to be mentioned in a couple of upcoming articles 
and presentations as poor-quality systems. Why? Because the developers 1) 
tried to do it too fast, 2) did not know the language, and/or 3) 
underestimated the linguistic work involved, etc. Forsaking quality for 
innovation is not the answer. Trying to develop tools too quickly often 
results in poor products, and this is bad publicity for the language 
technology field and for the language in question.

There are a lot of international standards, as well as inter-company tool 
processing standards, that are currently being established and defined for 
the upcoming generation of products. Underestimating the time, person, and 
cost investment to simply put an application on the market is not wise. It is 
important to count the costs and develop something correctly. There are 
hundreds of articles and manuals that show how this can be done in an 
appropriate and cost-effective way.

Basically, the next generation of language processing tools had better be 
more robust, or they will do more damage to the language business than do 
good to and for it.

Jeff Allen
(home): postediting@aol.com