[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
#2804: Creole Language Computer .... from Jeff Allen (fwd)
From:Postediting@aol.com
Regarding #2324 Creole Language Computer Software Development Update, Joel
Dreyfuss wrote:
>What would be very useful for those of us who are
>semi-literate in Kreyol is a Microsoft Word dictionary
>and spell-checker in the language. Any possibility
>that this could happen?
Marilyn Mason replied:
Subject : #2391: Creole Language Computer Software Development Update: from
Marilyn Mason
Date : 2/18/00 2:14:05
From:MariLinc@aol.com
>Yes, there is no technical reason why there could not
>be a user-friendly, fully functional software dictionary
>/spellchecker for Haitian Creole.
>But such an undertaking would require (not necessarily
>in their order of importance): time, personnel, money,
>leadership, coordination, and an enabling environment.
And Fequiere Vilsaint replied:
Subject: #2405: Creole Language Computer .... from Fequiere (fwd)
Date : 2/18/00 11:01:26
From:EDUCA@aol.com
>There is a Haitian Creole Spell Checker in the market
>at least for 7 years. It is updated every time MS Word
>or WP comes out with a new version. The MS Word version
>works seamlessly with your MS Word word Processor, as
>supplemental dictionary. Ditto for Corel Word Perfect.
>For a quick description, visit: www.educavision.com and
>click on Computer.
Jeff Allen now comments:
This is a bit lengthy, but it is packed with details.
I'm sending the message from my home e-mail address
(postediting@aol.com).
As has already been stated on the Haiti mailing list, a couple of Kreyol
spell-checkers, (and similar types of applications as I explain below) have
already been created, but they (including one that I helped develop as the
main linguist of the project) are not currently sufficient for the task of
robust spell-checking. All of these projects and products have been important
steps and stages in the process of developing better products, but there is
still a lot of work to do to make a Kreyol spell-checker as reliable as one
done by a Product Management team at MS Word.
Let me explain below.
Chris Hogan and I at Carnegie Mellon University created a spell-checker for
Haitian Kreyol in 1997 and 1998 as part of our research on the development of
several writing-, translation-, and speech-related tools. One of the
by-products of our research included the prototype of an optical character
recognition (OCR) output spell-checker. I say "by-product" because we did not
intend to create a spell-checker at the start, but since all of the
groundwork had been laid, it was possible to create one rapidly and to see
the effects of the spell-checker on the implementation of our overall system.
We obtained a copy of the EducaVision spellchecker (catalog ref. # S101) in
late 1996/ beginning 1997 and discovered that it was more or less a word
glossary list adapted to the appropriate WordPerfect and MS Word dictionary
formats. This is not to say that a glossary list is not important. It is in
fact a fundamental and very important step in building a spell-checker. The
starting point is obtaining a large database of electronic data. This is a
very difficult step in working with electronically sparse-data languages of
which Haitian Kreyol is one. Electronically sparse-data languages are those
in which electronic data of the written is either hard to find or hard to
obtain, and it is considerably less in volume when compared with other
international languages. I've been working on several sparse-data languages
for a few years in order to create boot-strapping methods and strategies for
quickly developing systems under sparse-data circumstances.
As the EducaVision spell-checker is most likely based on a large portion, if
not all, of the texts available in the EducaVision archives, the EducaVision
team has a very good data resource to start with. In many cases of minority
language projects that I am familiar with (for example, EMILLE minority
language project in Europe, BOAS project at New Mexico State University,
DIPLOMAT at Carnegie Mellon, etc), some teams have to actually type in all of
the data by hand from the start (with the risk of typing errors, of course).
EducaVision was at a vantage point on this issue because all of their data
was already electronic.
At Carnegie Mellon University, with a UNIX-based computer system and a huge
database (1.2 million word Kreyol language database that I collected from 15
independent sources, including 3,071 sentences that were donated to the
research project by EducaVision), my team created a very similar "derivative
research" baseline glossary in a couple of hours, and even made it available
on the Internet as announced on the Haiti Mailing list in the past (mid 1998
from what I recall).
Granted, as some Haiti Mailing Listers have previously pointed out, the CMU
glossary list contained not only correctly spelled words, but also "noisy"
material, including misspelled words. We openly and fully acknowledged this
point up front by stating that our glossary was strictly a term extraction
list of every word form (whether good or not) that occurred 2 or more times
in the entire database. Our glossary reflected the "written" Kreyol language
as-is with significant samples from many existing resources.
[Important note about that data: if anyone is using the CMU released
electronic Kreyol word list that has been provided to the public, it is
essential that you cite this contribution in your research papers and product
development descriptions. The US Fair Use law does not cover redistribution
efforts of data use. If you are using that Kreyol data that took two years of
development time to produce in the framework of a government-funded research
project for CMU, do make sure you are including the full bibliographic entry
in your reference list as it was found at the website where you located the
data. The use of the data is free, but free does not mean without citing the
origin of the data. Use of that data in part, in whole, or reproducable
derivatives, without giving proper reference, is infringement upon
international copyright laws. I'm the invited speaker at the TDCNet2000
Terminology conference this coming Monday with a talk entitled "Legal and
ethical issues regarding the diffusion and distribution of language resource
and terminology databases". My current job includes providing legal counsel
on electronic database laws at national and international levels.]
I also conducted quite a bit of background research in 1998 on how the
British National Corpus, Brown Corpus, and other large-scale corpus projects
have compiled their materials and provided representative, balanced,
cross-sectional samples. My co-authored paper at LREC98 with Chris Hogan on
Lexical coverage demonstrated our analysis of how we developed automated
methods for improving balance and coverage for monolingual and bilingual
(parallel) corpora and terminology lists. Also, several of my conference
talks in 1998 and 1999 included analyses of the Kreyol glossary that was
created on-site at CMU.
I know that not everyone necessarily agrees with my hypotheses about why
there is such a high level of lexical variation in written Kreyol (the
disagreement with "why" does not bother me), because this is a set of
sociolinguistic and psycholinguistic issues that have not yet been measured
in reliable and quantifiable ways. The important fact that cannot be denied,
however, is the end result: there is a very high level of lexical form
variation occuring not only between texts (ie "intertextual lexical
variation"), but also within texts by the same groups (ie "intratextual
lexical variation"). My SPCL98, AMTA98, EtudesCreoles99 conference talks all
provide clear evidence of this with frequency word counts per word form and
on each set of texts analyzed.
One of the main points that I have often mentioned from these several years
of work is that there is a great need for a spell-checking mechanism.
However, there are a number of important points that I have honestly
struggled with over the years and that we should be aware of:
1) Yes, written language shows a great need for standardization if it is to
be used in coherent and consistent ways in combination with a universal
Kreyol spell-checker, but
2) Will Kreyol standardization research and development work end up just
sitting on another university library shelf or on a CD-ROM as another
educational research prototype project?
3) Will the results of such standardization work ever reach the level of
grassroots literacy projects and/or is it needed for such projects?
4) Who will benefit from this work (Haitian diaspora in the USA, Canada,
France and elsewhere; high school and college students students in Haiti;
Haitian elite)?
5) Will there ever be Haitian government funding for such "language" work
when there is still so much to do to just repair the roads in Port-au-Prince
and to provide the basic necessities (food, water, clothing, shelter) for the
Haitian people who for the most part are non-literate?
6) Who else will participate in a financial way in such "language tool"
development work?
These are the kinds of questions I ask myself a lot because I have spent
thousands of hours over the past 10 years working on them.
Now on to the issues of dictionary clean-up work, which is a very
time-consuming task, as I will demonstrate below. Lexical glossary and
dictionary clean-up work is a long, tedious task, but it is imperative that
it be done, and done correctly, if any form (paperback, electronic) of a
dictionary or glossary is to be used by a human being or by an electronic
system.
Let me give you some concrete examples.
* When I worked on the multilingual documentation project at Caterpillar Inc
in 1995 and 1996, we had to first start with a technical-term dictionary
clean-up task in English, before any work could be done on the word/term
equivalents in the 11 other languages that we were translating documentation
into. It took one person a period of one full year to read through a 1
million word technical term glossary and reduce this glossary down to 50,000
technical terms and common words. That was a project that had a lot of money
and was conducting the terminology clean-up work on a multimillion dollar
project for a definite return on investment for technical writing and
translation purposes.
* I am currently in contact with several major translation and localization
projects that all emphasize as well that terminological and dictionary
compilation work is the core issue. I have been developing strategies to help
them conduct this clean-up within the framework of the critical path for
rapid, efficient and cost-effective results.
* On the Haitian Creole project at CMU, it took our team approximately 4-6
person months of work to manually filter all of the unnecessary and noisy
data (misspelled words, wrong words, etc) out of the glossary mentioned
further above. The resulting cleaned dictionary was much more efficient for
the translation system that we were developing.
As for making adapted products to MS word and other commercial software
programs, this can be done in a very short period of time if one knows the
relevant APIs and standards to be followed.
However, a robust and fully integrative spell-checker must go beyond the
first step of being a reformatted glossary file. In order to create a top
quality spell-checking application, it is necessary to ramp up the lexical
coverage of the dictionary. Each time I hear of a new Creole dictionary that
comes out with claims that it has 30, 40, even 50 thousand entries, I simply
shake my head knowing that this is not true. This might refer to total
entries, but not separate entries. From the 1.2 million word Kreyol database
we developed at CMU based on 15 independent sources of Kreyol texts from
different fields and domains, including some specialized areas, we only ended
up with 33,564 different word types (that means differently spelled words)
out of a total of 1,200,613 word tokens (that means the total number of
individual words in the data, including repetitions). I write journal review
articles on many Creole language dictionaries and know that no such reference
can realistically claim more than a total of 10 thousand different word token
entries. This is because of the manpower that is needed to do the cleanup
work and to provide sample entries (for dictionaries), combined with the
overall purpose of the dictionary, the domains represented, and the depth of
work performed. And if it is a bilingual dictionary, then the equivalent
entries need to be discounted because then it is considered as
double-counting the same headwords. Simply adding the entries of one half of
a dictionary to those in the second half, without discounting exact
equivalents (where there are not cases of synonymy indicated), is cheating in
a way from a lexical point of view for the sake of giving an impression of a
higher number of entries than there really is.
Lexical coverage is a key issue. Due to the lack of lexical coverage found in
all texts and applications done for Haitian Kreyol previous to Jan 1998 when
Chris and I presented our first paper on the results of our research (ALLEN,
J. and C. HOGAN. Evaluating Haitian Creole orthographies from a
non-literacy-based pespective. Presented at the Society for Pidgin and Creole
Linguistics annual meeting, 10 January 1998, New York), we created an
algorithm called TEXT-R that was specifically designed to increase lexical
coverage in texts to be translated by our on-site native speaking translation
team members. The very successful results of our research were published a
few months later (ALLEN, J. and C. HOGAN. Expanding lexical coverage of
parallel corpora for the Example-based Machine Translation approach. In
proceedings of the 1st international Language Resources and Evaluation
Conference (LREC98) presented in Granada, Spain, 28-30 May 1998). From that
research, we showed how it was possible to ramp up and thus significantly
increase lexical coverage in dictionary and translation compilation tasks. In
a period of 5 months (Oct 97 - March 98), we were able to increase the
lexical coverage of one of our translated databases by the same amount of
(new) items as had been compiled during the previous 1.5 years of the
project. We then implemented this new program to our multiple translation
projects and found that it was very effective for increasing the coverage of
the dictionaries/glossaries, and making the translators' work easier, more
productive, and less repetitive/boring.
Yet, terminology extraction and even lexical coverage are not enough.
Mistakes in the words and other various issues indicate the need for
additional tasks and tools. Chris H. and I then went on to develop a Kreyol
accent mark re-insertion program for processing electronic Kreyol texts of
Haiti Progres from which all accent marks had been removed for the
electronically diffused version of the newspaper. We also created the
prototype of an optical character recognition (OCR) output spell-checker to
work on other types of imaged-based texts. There are several published
articles and conference papers on this research.
>From our in-depth experience in this type of work for Haitian Creole and
several other languages (Croatian, Korean, Spanish, French, Arabic), our
research confirmed the work of many other computational linguists who develop
spell-checking applications. There are several crucial points to be
implemented in the development of an excellent spell-checker. These kinds of
discussions appear all the time on many of the relevant e-lists.
Even the 5-8 person years of work invested into the entire Haitian Creole
project at CMU did not lead to a fully-functional and robust enough
spell-checker. We learned to develop modifications for our own specific needs
with regard to the translation and related systems that we were developing.
More recently, I have spent the last year and a half working on the project
funding and product distribution end of the cycle and know that developing a
marketable product is not a short and simple task. Putting into place a
Kreyol spell-checker that 1) can work on multiple platforms and processors,
2) can identify non-words and propose valid words, 3) can identify
misspellings and propose their appropriate equivalents, 4) can identify words
and make an automatic decision that is based on grammatical context, and use
other strategies of "fuzzy matching", etc, well folks this is going to take a
lot of time and resource investment. It simply cannot be created overnight.
Who is going to pay for that?
I know who provides the funding for the development of spell-checking modules
of the major languages, and I know who wins the bids for such projects. Who
will be the 1) paying and 2) development players for the next generation of
Kreyol natural language processing (NLP) applications?
When new language processing tools appear, I investigate them very quickly,
especially if they supposedly present novel ideas. I've seen a recent
"attempt" or two for the development of Creole language processing tools, and
they are unfortunately going to be mentioned in a couple of upcoming articles
and presentations as poor-quality systems. Why? Because the developers 1)
tried to do it too fast, 2) did not know the language, and/or 3)
underestimated the linguistic work involved, etc. Forsaking quality for
innovation is not the answer. Trying to develop tools too quickly often
results in poor products, and this is bad publicity for the language
technology field and for the language in question.
There are a lot of international standards, as well as inter-company tool
processing standards, that are currently being established and defined for
the upcoming generation of products. Underestimating the time, person, and
cost investment to simply put an application on the market is not wise. It is
important to count the costs and develop something correctly. There are
hundreds of articles and manuals that show how this can be done in an
appropriate and cost-effective way.
Basically, the next generation of language processing tools had better be
more robust, or they will do more damage to the language business than do
good to and for it.
Jeff Allen
(home): postediting@aol.com