
FILE: dutch.words
VERSION: DEC-SRC-92-Apr-05

EDITOR

    Jorge Stolfi <stolfi@src.dec.com>
    DEC Systems Research Center
  
AUTHORS OF ORIGINAL WORDLISTS

    Henk Smit <henk@cs.vu.nl>
    Jan van Bakel, Nijmegen University, the Netherlands
    Martien Kuunders <m.m.l.kuunders@research.ptt.nl>
    Paul Stravers <stravers@donau.et.tudelft.nl>

DESCRIPTION

    The file dutch.words is a list of over 190,000 Dutch words
    and proper nouns, compiled from several public domain wordlists.

    The file has one word per line, and is sorted with sort(1) in
    plain ASCII collating sequence.

    The file is supposed to include verb forms, declined nouns, words
    derived by standard prefixes and suffixes, and compound words.
    However, the list appears to be highly incomplete and
    inconsistent.

    All words are encoded using only the lowercase letters, hyphen and
    apostrophe.  There are no accents, and proper names are NOT
    capitalized.

AUXILIARY LISTS

    In the same directory as dutch.words
    you will find the follwing files:

    dutch.trash

        A list of 1910 words from the original wordlists that
        I decided were either wrong or unsuitable for inclusion
        in the file dutch.words.  The list includes a few typos,
        a couple dozen words with accents and other funny characters,
        some abbreviations and acronyms, and many foreign words
        (English, French, Spanish, and Italian).

    dutch.maybe

        A list of 43059 additional words from the file "words.dutch.Z"
        obtained from T. U. Delft [2].

        These words should be checked and merged into the 
        other two lists (dutch.words aor dutch.trash, as appropriate).
        Unfortunately I cannot do it (I don't know a word of
        of dutch).  Use at your own risk...
        
ORIGINAL LISTS 

    The files dutch.* were compiled from the following original wordlists,
    that I obtained by anonymous FTP on 92-Feb-10.

    [1] file: words.dutch.Z
        size: 779056 bytes (1998881 bytes uncompressed)
        contact: Henk Smit <henk@cs.vu.nl>
        from: relay.cs.toronto.edu: /pub/doc/Dictionaries

           * From the README file at ftp.cs.vu.nl:

                  This list is made out of some smaller lists, 
                    het Groene Boekje (available at donau.et.tudelft.nl)
                    TeX dutch wordlist (available at archive.cs.ruu.nl)
                    local additions at de Vrije Universiteit (cs.vu.nl)

    [2] from: donau.et.tudelft.nl: /pub/words/
        file: words.dutch
        size: 2391803 bytes
        authors: Jan van Bakel, Nijmegen University, the Netherlands
                 Paul Stravers <stravers@donau.et.tudelft.nl>

           * From the README file at svin01.win.tue.nl:

                  words.dutch.Z: 
                  ftp.cs.vu.nl:/pub/dictionairies/words.dutch.Z
                    Is uitbreiding van platte_lijst (Groene boekje)
                    met zo'n 36000 woorden.
      
    [3] file: WoordenLijst.Z
        size: 195513 bytes (458909 bytes uncompressed)
        contact: M Kuunders <M.M.L.Kuunders@research.ptt.nl>
        from: svin01.win.tue.nl: /pub/textproc/dictionaries/dutch

    [4] file: words.dutch.Z
        size: 777860 bytes (1997262 bytes uncompressed)
        from: phloem.uoregon.edu: /pub/src/security/dictionaries

    [5] file: words.dutch.Z
        size: 664311 bytes (1675313 bytes uncompressed)
        from: ftp.hawaii.edu: /pub/editors/LEXICAL/word-lists

    [6] file: platte_lijst.Z
        size: 639758 bytes (1607623 bytes uncompressed)
        from: svin01.win.tue.nl: /pub/textproc/dictionaries/dutch

           * From the README file in svin01.win.tue.nl:

                  platte_lijst.Z: 
                  donau.et.tudelft.nl:/pub/woordenlijst/platte_lijst
                    Uit Groene boekje.
        
    COMMENTS: The three significant lists are toronto:words.ducth,
    [1], tudelft:words.dutch and [2], and tue:WoordenLijst [3].  Any
    two of them differ by several thousand words, in each direction.
    For instance, [2] lacks 21,226 words from [1] but includes 47,950
    words that are not in [1].

    The list uoregon:words.dutch [4] is the same as
    toronto:words.dutch, minus 120 words.

    The lists hawaii:words.dutch [5] and tue:platte_lijst [6] are
    practically subsets of both toronto:words.dutch and
    tudelft:words.dutch; [5] lacks 8603 words to cover their
    intersection, whereas and [6] lacks 15046 words.

COMPILATION PROCESS    

    The file dutch.words is basically the union of the files 
    "WoordenLijst" [3] and "words.dutch" [4]. However, I deleted
    some 1200 words that appeared to be typos, foreign words, and
    computer slang.  (I tried to be very conservative, but since I
    don't know the language, it is almost certain that I removed a few
    legitimate words. Sorry...)

    The table below gives the number of lowercase words in each
    original list ("lcase"), and how many of such words were included
    ("accept") and not included ("reject") in the final file
    dutch.words:

        ref  site: file                lcase   accept  reject
        ---  ----------------------  -------  -------  ------
        [1]  toronto: words.dutch     178429   177662     767
        [2]  tudelft: words.dutch     205153   160548   44605
        [3]  tue: WoordenLijst         43574    42978     596 
        [4]  uoregon: words.dutch     178309   177553     756 
        [5]  hawaii: words.dutch      148601   147878     723
        [6]  tue: platte_lijst        142283   141570     713 

(NON-)COPYRIGHT STATUS

  To the best of my knowledge, all the files I used to build these
  wordlists were available for public distribution and use, at least
  for non-commercial purposes.  I have confirmed this assumption with
  the authors of the lists, whenever they were known.
  
  Therefore, it is safe to assume that the wordlists in this package
  can also be freely copied, distributed, modified, and used for
  personal, educational, and research purposes.  (Use of these files in
  commercial products may require written permission from DEC and/or
  the authors of the original lists.)
  
  Whenever you distribute any of these wordlists, please distribute
  also the accompanying README file.  If you distribute a modified
  copy of one of these wordlists, please include the original README
  file with a note explaining your modifications.  Your users will
  surely appreciate that.

(NO-)WARRANTY DISCLAIMER

  These files, like the original wordlists on which they are based,
  are still very incomplete, uneven, and inconsitent, and probably
  contain many errors.  They are offered "as is" without any warranty
  of correctness or fitness for any particular purpose.  Neither I nor
  my employer can be held responsible for any losses or damages that
  may result from their use.

