29 March 2002
dictd - a dictionary database server
dictd is a server for the Dictionary Server Protocol (DICT), a TCP transaction based query/response protocol that allows a client to access dictionary definitions from a set of natural language dictionary databases.
For security reasons, dictd drops root permissions after startup. If user dictd exists on the system, the daemon will run as that user, group dictd,otherwiseitwillrunasuser nobody, group nobody or nogroup (depending on the operating system distribution).
Since startup time is significant, the server is designed to run continuously, and should not be run from inetd(8). (However, with a fast processor, it is feasible to do so.)
Databases are distributed separately from the server.
By default, dictd assumes that the index files are sorted alphabetically, and only alphanumeric characters from the 7-bit ASCII character set are used for search. This default may be overridden by a header in the data file. The only such features implemented at this time are the headers "00-database-allchars" which tells dictd that non-alphanumeric characters may also be used for search, the header "00-database-utf8" which indicates that the database uses utf8 encoding, and the "00-database-8bit-new" which indicates that the database is encoded and sorted according to a locale that uses an 8-bit encoding.
For many years, the Internet community has relied on the "webster" protocol for access to natural language definitions. The webster protocol supports access to a single dictionary and (optionally) to a single thesaurus. In recent years, the number of publicly available webster servers on the Internet has dramatically decreased.
Fortunately, several freely-distributable dictionaries and lexicons have recently become available on the Internet. However, these freely-distributable databases are not accessible via a uniform interface, and are not accessible from a single site. They are often small and incomplete individually, but would collectively provide an interesting and useful database of English words. Examples include the Jargon file, the WordNet database, MICRA’s version of the 1913 Webster’s Revised Unabridged Dictionary, and the Free Online Dictionary of Computing. (See the DICT protocol specification (RFC) for references.) Translating and non-English dictionaries are also becoming available (for example, the FOLDOC dictionary is being translated into Spanish).
The webster protocol is not suitable for providing access to a large number of separate dictionary databases, and extensions to the current webster protocol were not felt to be a clean solution to the dictionary database problem.
The DICT protocol is designed to provide access to multiple databases. Word definitions can be requested, the word index can be searched (using an easily extended set of algorithms), information about the server can be provided (e.g., which index search strategies are supported, or which databases are available), and information about a database can be provided (e.g., copyright, citation, or distribution information). Further, the DICT protocol has hooks that can be used to restrict access to some or all of the databases.
dictd(8) is a server that implements the DICT protocol. Bret Martin implemented another server, and several people (including Bret and myself) have implemented clients in a variety of languages.
|-V or --version|
|Display version information.|
|Display copyright and license information.|
|-h or --help|
|Display help information.|
|-v or --verbose or -dverbose|
|-c file or --config file|
|Specify configuration file. The default is /etc/dictd.conf , but may be changed in the defs.h file at compile time (DICTD_CONFIG_FILE).|
|-p port or --port port|
|Overrides the keyword port in Global Settings Specification section of configuration file.|
|-i or --inetd|
|Communicate on standard input/output, suitable for use from inetd. Although, due to its rather large startup time, this daemon was not intended to run from inetd, with a fast processor it is feasible to do so. This option also implies --fast-start.|
|Sets a preprocessor for configuration file. like m4 or cpp . See examples/dictd_complex.conf file from distribution. By default configuration file is parsed without preprocessor.|
|Overrides the keyword depth in Global Settings Specification section of configuration file.|
|Overrides the keyword delay in Global Settings Specification section of configuration file.|
|The same as syslog_facility keyword in Global Settings Specification of configuration files.|
|-f or --force|
|Force the daemon to start even if an instance of the daemon is already running. (This is of little value unless a non-default port is specified with -p, since, if one instance is bound to a port, the second one fails when it can not bind to the port.)|
|Overrides the keyword limit in Global Settings Specification section of configuration file.|
|Overrides the keyword listen_to in Global Settings Specification section of configuration file.|
|Overrides the keyword locale in Global Settings Specification section of configuration file.|
|-s||The same as syslog keyword in Global Settings Specification of configuration files.|
|-L file or --logfile file|
|The same as log_file keyword in Global Settings Specification of configuration files.|
|The same as pid_file keyword in Global Settings Specification of configuration files.|
|-m minutes or --mark minutes|
|Overrides the keyword timestamp in Global Settings Specification section of configuration file.|
|Overrides the keyword default_strategy in Global Settings Specification section of configuration file.|
|The same as without_strategy keyword in Global Settings Specification of configuration files.|
|The same as add_strategy keyword in Global Settings Specification of configuration files.|
|The same as fast_start keyword in Global Settings Specification of configuration files.|
|The same as without_mmap keyword in Global Settings Specification of configuration files.|
|When applied with --inetd, each command obtained from stdin is output to stdout. This option is useful for debugging.|
|-l option or --log option|
|The same as log_option keyword in Global Settings Specification of configuration files.|
|The same as debug_option keyword in Global Settings Specification of configuration files.|
|The configuration file defaults to /etc/dictd.conf but can be specified on the command line with the -c option (see above).
The configuration file is read into memory at startup, and is not referenced again by dictd unless a signal 1 (SIGHUP) is received, which will cause dictd to reread the configuration file.
The file is divided into sections. The Access Section should come first, followed by the Database Section, and the User Section. The Database Section is required; the others are optional, but they must be in the order listed here.
|Syntax||The following keywords are valid in a configuration file: access, allow, deny, group, database, data, index, filter, prefilter, postfilter, name, include, user, authonly, site. Keywords are case sensitive. String arguments that contain spaces should be surrounded by double quotes. Without quoting, strings may contain alphanumeric characters and _, -, ., and *, but not spaces. Strings can be continued between lines. \", \\, \n, lt;NL> are treated as double quote, backslash, new line and no symbol respectively. Comments start with # and extend to the end of the line.|
|Global Settings Section|
|Global Settings Specification|
|This section describes the following parameters:
|The database specification describes the database:
|Virtual Database Specification|
|The virtual database specification describes the virtual database:
|The text of the file "string" (usually a database specification) will be read as if it appeared at this location in the configuration file. Nested includes are not permitted.|
DETERMINATION OF ACCESS LEVEL
When a client connects, the global access specification is scanned, in order, until a specification matches. If no access specification exists, all access is allowed (e.g., the action is the same as if "allow *" was the only item in the specification). For each item, both the hostname and IP are checked. For example, consider the following access specification:
allow 10.42.*With this specification, all clients in the 10.42 network will be allowed access to unrestricted databases; all clients from *.edu sites will be allowed to authenticate, but will be denied access to all databases, even those which are otherwise unrestricted; and all other clients will have their connection terminated immediately. The 10.42 network clients can send an AUTH command and gain access to restricted databases. The *.edu clients must send an AUTH command to gain access to any databases, restricted or unrestricted.
When the AUTH command is sent, the access list for each database is scanned, in order, just as the global access list is scanned. However, after authentication, the client has an associated username. For example, consider the following access specification:
user u1If the client authenticated as u1, then the client will have access to this database, even if the client comes from a *.com site. In contrast, if the client authenticated as u2, the client will only have access if it does not come from a *.com site. In this case, the "user u2" is redundant, since that client would also match "allow *".
Warning: Checks are performed for domain names and for IP addresses. However, if reverse DNS for a specific site is not working, it is possible that a domain name may not be available for checking. Make sure that all denials use IP addresses. (And consider a future enhancement: if a domain name is not available, should denials that depend on a domain name match anything? This is the more conservative viewpoint, but it is not currently implemented.)
The DICT standard specifies a few search algorithms that must be implemented, and permits others to be supported on a server-dependent basis. The following search strategies are supported by this server. Note that all strategies are case insensitive. Most ignore non-alphanumeric, non-whitespace characters.
|exact||An exact match. This algorithm uses a binary search and is one of the fastest search algorithms available.|
|lev||The Levenshtein algorithm (string edit distance of one). This algorithm searches for all words which are within an edit distance of one from the target word. An "edit" means an insertion, deletion, or transposition. This is a rapid algorithm for correcting spelling errors, since many spelling errors are within a Levenshtein distance of one from the original word.|
|prefix||Prefix match. This algorithm also uses a binary search and is very fast.|
|Like prefix but returns the specified range of matches. For example, when prefix strategy returns 1000 matches, you can get only 100 ones skipping the first 800 matches. This is made by specified these limits in a query like this: 800#100#app, where 800 is skip count, 100 is a number of matches you want to get and "app" is your query. This strategy allows to implement DICT client with fast autocompletion (although it is not trivial) just like many standalone dictionary programs do.
NOTE: If you access the dictionary "*" (or virtual one) with nprefix strategy, the same range is set for each database in it, but globally for all matches found in all databases.
NOTE: In case you access non-english dictionary the returned matches may be (and mostly will be) NOT ordered in alphabetic order.
|re||POSIX 1003.2 (modern) regular expression search. Modern regular expressions are the ones used by egrep(1). These regular expressions allow predefined character classes (e.g., [[:alnum:]], [[:alpha:]], [[:digit:]], and [[:xdigit:]] are useful for this application); uses * to match a sequence 0 or more matches of the previous atom; uses + to match a sequence of 1 or more matches of the previous atom; uses ? to match a sequence of 0 or 1 matches of the previous atom; used ^ to match the beginning of a word, uses $ to match the end of a word, and allows nested subexpression and alternation with () and |. For example, "(foo|bar)" matches all words that contain either "foo" or "bar". To match these special characters, they must be quoted with two backslashes (due to the quoting characteristics of the server). Warning: Regular expression matches can take 10 to 300 times longer than substring matches. On a busy server, with many databases, this can required more than 5 minutes of waiting time, depending on the complexity of the regular expression.|
|regexp||Old (basic) regular expressions. These regular expressions don’t support |, +, or ?. Groups use escaped parentheses. While modern regular expressions are generally easier to use, basic regular expressions have a back reference feature. This can be used to match a second occurrence of something that was already matched. For example, the following expression finds all words that begin and end with the same three letters:
Note the use of the double backslashes to escape the special characters. This is required by the DICT protocol string specification (a single backslash quotes the next character -- we use two to get a single backslash through to the regular expression engine). Warning: Note that the use of backtracking is even slower than the use of general regular expressions.
|The Soundex algorithm, a classic algorithm for finding words that sound similar to each other. The algorithm encodes each word using the first letter of the word and up to three digits. Since the first letter is known, this search is relatively fast, and it sometimes good for correcting spelling errors when the Levenshtein algorithm doesn’t help.|
|Match a substring anywhere in the headword. This search strategy uses a modified Boyer-Moore-Horspool algorithm. Since it must search the whole index file, it is not as fast as the exact and prefix matches.|
|suffix||Suffix match. This search strategy also uses a modified Boyer-Moore-Horspool algorithm, and is as fast as the substring search. If the optional index_suffix string file is listed in the configuration file this search is much faster.|
|word||Match any single word, even if part of a multi-word entry. If the optional index_word string file is listed in the configuration file this search strategy works much faster.|
|first||Match the first word that begins a multi-word entry.|
|last||Match the last word that ends a multi-word entry. If the optional index_suffix string file is listed in the configuration file this search strategy works much faster.|
Databases for dictd are distributed separately. A database consists of two files. One is a flat text file, the other is the index.
The flat text file contains dictionary entries (or any other suitable data), and the index contains tab-delimited tuples consisting of the headword, the byte offset at which this entry begins in the flat text file, and the length of the entry in bytes. The offset and length are encoded using base 64 encoding using the 64-character subset of International Alphabet IA5 discussed in RFC 1421 (printable encoding) and RFC 1522 (base64 MIME). Encoding the offsets in base 64 saves considerable space when compared with the usual base 10 encoding, while still permitting tab characters (ASCII 9) to be used for delimiting fields in a record. Each record ends with a newline (ASCII 10), so the index file is human readable.
Some headwords are used by dictd especially
00-database-info Containts the information about database which is returned by SHOW INFO command, unless it is specified in the configuration file.
00-database-short Containts the short name of the database which is returned by SHOW DB command, unless it is specified in the configuration file. See dictfmt -s.
00-database-url URL where original dictionary sources were obtained from. See dictfmt -u. This headword is not used by dictd
00-database-utf8 Presents if dictionary is encoded using UTF-8. See dictfmt --utf8
00-database-8bit-new Presents if dictionary is encoded using 8-BIT character set (not ASCII and not UTF8). See dictfmt --locale.
The flat text file may be compressed using gzip(1) (not recommended) or dictzip(1) (highly recommended). Optimal speed will be obtained using an uncompressed file. However, the gzip compression algorithm works very well on plain text, and can result in space savings typically between 60 and 80%. Using a file compressed with gzip(1) is not recommended, however, because random access on the file can only be accomplished by serially decompressing the whole file, a process which is prohibitively slow. dictzip(1) uses the same compression algorithm and file format as does gzip(1), but provides a table that can be used to randomly access compressed blocks in the file. The use of 50-64kB blocks for compression typically degrades compression by less than 10%, while maintaining acceptable random access capabilities for all data in the file. As an added benefit, files compressed with dictzip(1) can be decompressed with gzip(1) or zcat(1). (Note: recompressing a dictzip’d file using, for example, znew(1) will destroy the random access characteristics of the file. Always compress data files using dictzip(1).)
SIGHUP causes dictd to reread configuration file and reinitialize databases.
SIGUSR1 causes dictd to unload databases. Then dictd returns 420 status (instead of 220). To load databases again, send SIGHUP signal. Because database files are mmap’ed(2) , it is impossible to update them while dictd is running. So, if you need to update database files and reread configuration file, first, send SIGUSR1 signal to dictd to unload databases, update files, and then send SUGHUP signal to load them again.
The main source files for the dictd server and the dictzip compression program were written by Rik Faith (faith) and are distributed under the terms of the GNU General Public License. If you need to distribute under other terms, write to the author.
The main libraries used by these programs (zlib, regex, libmaa) are distributed under different terms, so you may be able to use the libraries for applications which are incompatible with the GPL -- please see the copyright notices and license information that come with the libraries for more information, and consult with your attorney to resolve these issues.
The regular expression searches do not ignore non-whitespace, non-alphanumeric characters as do the other searches. In practice, this isn’t much of a problem.
Conformance of regular expressions (used by ’re’ and ’regexp’ search strategies) to ERE and BRE depends on library you build dictd with. Whether ’re’ and ’regex’ strategies support utf8 depends on library you build dictd with.
|dictd configuration file|
|dictd daemon itself|
|File for storing pid of dictd daemon|
|The default directory for dictd databases (.index and .dict[.dz] files)|