From 0bdce0843a42d6e4957d4b37b0a0644fcdd765fc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20=C4=8Ciha=C5=99?= Date: Wed, 8 Nov 2017 13:46:27 +0100 Subject: New upstream version 0.5.2 --- doc/DICTFILE_FORMAT | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++ doc/sdcv.1 | 104 ++++++++++++++++ doc/uk/sdcv.1 | 84 +++++++++++++ 3 files changed, 540 insertions(+) create mode 100644 doc/DICTFILE_FORMAT create mode 100644 doc/sdcv.1 create mode 100644 doc/uk/sdcv.1 (limited to 'doc') diff --git a/doc/DICTFILE_FORMAT b/doc/DICTFILE_FORMAT new file mode 100644 index 0000000..d1b1d9d --- /dev/null +++ b/doc/DICTFILE_FORMAT @@ -0,0 +1,352 @@ +Format for StarDict dictionary files +------------------------------------ + +StarDict homepage: http://stardict.sourceforge.net + +{0}. Number and Byte-order Conventions +When you record the numbers that identify sizes, offsets, etc., you +should use 32-bit numbers, such as you might represent with a glong. + +In order to make StarDict work on different platforms, these numbers +must be in network byte order. You can ensure the correct byte order +by using the g_htonl() function when creating dictionary files. +Conversely, you should use g_ntohl() when reading dictionary files. + +Strings should be encoded in UTF-8. + + +{1}. Files +Every dictionary consists of three files: +(1). somedict.ifo +(2). somedict.idx or somedict.idx.gz +(3). somedict.dict or somedict.dict.dz + +You can use gzip -9 to compress the .idx file. If the .idx file are not +compressed, the loading can be fast and save memory when using, compress it +will make the .idx file load into memory and make the quering fast when using. + +You can use dictzip to compress the .dict file. +"dictzip" uses the same compression algorithm and file format as does gzip, +but provides a table that can be used to randomly access compressed blocks +in the file. The use of 50-64kB blocks for compression typically degrades +compression by less than 10%, while maintaining acceptable random access +capabilities for all data in the file. As an added benefit, files +compressed with dictzip can be decompressed with gunzip. +For more information about dictzip, refer to DICT project, please see: +http://www.dict.org + +Stardict will search for the .ifo file, then open the .idx or +.idx.gz file and the .dict.dz or .dict file which is in the same directory and +has the same base name. + + + +{2}. The ".ifo" file's format. +The .ifo file has the following format: + +StarDict's dict ifo file +version=2.4.2 +[options] + +Note that the current "version" string must be "2.4.2". If it's not, +then StarDict will refuse to read the file. + +[options] +--------- +In the example above, [options] expands to any of the following lines +specifying information about the dictionary. Each option is a keyword +followed by an equal sign, then the value of that option, then a +newline. The options may be appear in any order. + +Note that the dictionary must have at least a bookname, a wordcount and a +idxfilesize, or the load will fail. All other information is optional. All +strings should be encoded in UTF-8. + +Available options: + +bookname= // required +wordcount= // required +idxfilesize= // required +author= +email= +website= +description= +date= +sametypesequence= // very important. + + +wordcount is the count of word entries in .idx file, it must be right. + +idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed +to a .idx.gz file, this entry must record the original .idx file's size, and it +must be right too. The .gz file don't contain its original size information, +but knowing the original size can speed up the extraction to memory, as you +don't need to call realloc() for many times. + + +The "sametypesequence" option is described in further detail below. + +*** +sametypesequence + +You should first familiarize yourself with the .dict file format +described in the next section so that you can understand what effect +this option has on the .dict file. + +If the sametypesequence option is set, it tells StarDict that each +word's data in the .dict file will have the same sequence of datatypes. +In this case, we expect a .dict file that's been optimized in two +ways: the type identifiers should be omitted, and the size marker for +the last data entry of each word should be omitted. + +Let's consider some concrete examples of the sametypesequence option. + +Suppose that a dictionary records many .wav files, and so sets: + sametypesequence=W +In this case, each word's entry in the .dict file consists solely of a +wav file. In the .dict file, you would leave out the 'W' character +before each entry, and you would also omit the 32-bit integer at the +front of each .wav entry that would normally give the entry's length. +You can do this since the length is known from the information in the +idx file. + +As another example, suppose a dictionary contains phonetic information +and a meaning for each word. The sametypesequence option for this +dictionary would be: + sametypesequence=tm +Once again, you can omit the 't' and 'm' characters before each data +entry in the .dict file. In addition, you should omit the terminating +'\0' for the 'm' entry for each word in the .dict file, as the length +of the meaning string can be inferred from the length of the phonetic +string (still indicated by a terminating '\0') and the length of the +entire word entry (listed in the .idx file). + +So for cases where the last data entry for each word normally requires +a terminating '\0' character, you should omit this character in the +dict file. And for cases where the last data entry for each word +normally requires an initial 32-bit number giving the length of the +field (such as WAV and PNG entries), you must omit this number in the +dictionary. + +Every dictionary should try to use the sametypesequence feature to +save disk space. +*** + + +{3}. The ".idx" file's format. +The .idx file is just a word list. + +The word list is a sorted list of word entries. + +Each entry in the word list contains three fields, one after the other: + word_str; // a utf-8 string terminated by '\0'. + word_data_offset; // word data's offset in .dict file + word_data_size; // word data's total size in .dict file + +word_str gives the string representing this word. It's the string +that is "looked up" by the StarDict. + +word_data_offset and word_data_size should both be 32-bit numbers in +network byte order. + +No two entries should have the same "word_str". In other words, +(strcmp(s1, s2) != 0). + +The length of "word_str" should be less than 256. In other words, +(strlen(word) < 256). + +The word list must be sorted by calling stardict_strcmp() on the "word_str" +fields. If the word list order is wrong, StarDict will fail to function +correctly! + +============ +gint stardict_strcmp(const gchar *s1, const gchar *s2) +{ + gint a; + a = g_ascii_strcasecmp(s1, s2); + if (a == 0) + return strcmp(s1, s2); + else + return a; +} +============ +g_ascii_strcasecmp() is a glib function: +Unlike the BSD strcasecmp() function, this only recognizes standard +ASCII letters and ignores the locale, treating all non-ASCII characters +as if they are not letters. + +stardict_strcmp() works fine with English characters, but the other +locale characters' sorting is not so good. There should be a _strcmp +function which handles the utf-8 string sorting better. If you know +one, email me :) + +g_utf8_collate()? This is a locale-dependent funcition. So if you look +up Chinese characters while in the Chinese locale, it works fine. But +if you are in some other locale then the lookup will fail, as the +order is not the same as in the Chinese locale (which was used when +creating the dictionary). + +g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but.. + +The complete solution can be found in "Unicode Technical Standard #10: Unicode +Collation Algorithm", http://www.unicode.org/reports/tr10/ + +I hope glib will provide a locale-independent g_utf8_collate() soon. +http://bugzilla.gnome.org/show_bug.cgi?id=112798 + + + +{4}. The ".dict" file's format. +The .dict file is a pure data sequence, as the offset and size of each +word is recorded in the corresponding .idx file. + +If the "sametypesequence" option is not used in the .ifo file, then +the .dict file has fields in the following order: +============== +word_1_data_1_type; // a single char identifying the data type +word_1_data_1_data; // the data +word_1_data_2_type; +word_1_data_2_data; +...... // the number of data entries for each word is determined by + // word_data_size in .idx file +word_2_data_1_type; +word_2_data_1_data; +...... +============== +It's important to note that each field in each word indicates its +own length, as described below. The number of possible fields per +word is also not fixed, and is determined by simply reading data until +you've read word_data_size bytes for that word. + + +Suppose the "sametypesequence" option is used in the .idx file, and +the option is set like this: +sametypesequence=tm +Then the .dict file will look like this: +============== +word_1_data_1_data +word_1_data_2_data +word_2_data_1_data +word_2_data_2_data +...... +============== +The first data entry for each word will have a terminating '\0', but +the second entry will not have a terminating '\0'. The omissions of +the type chars and of the last field's size information are the +optimizations required by the "sametypesequence" option described +above. + + +Type identifiers +---------------- +Here are the single-character type identifiers that may be used with +the "sametypesequence" option in the .idx file, or may appear in the +dict file itself if the "sametypesequence" option is not used. + +Lower-case characters signify that a field's size is determined by a +terminating '\0', while upper-case characters indicate that the data +begins with a 32-bit integer that gives the length of the data field. + +'m' +Word's pure text meaning. +The data should be a utf-8 string ending with '\0'. + +'l' +Word's pure text meaning. +The data is NOT a utf-8 string, but is instead a string in locale +encoding, ending with '\0'. Sometimes using this type will save disk +space, but its use is discouraged. + +'g' +A utf-8 string which is marked up with the Pango text markup language. +For more information about this markup language, See the "Pango +Reference Manual." +You might have it installed locally at: +file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html + +'t' +English phonetic string. +The data should be a utf-8 string ending with '\0'. + +Here are some utf-8 phonetic characters: +θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ +æɑɒʌәєŋvθðʃʒːɡˏˊˋ + +'y' +Chinese YinBiao. +The data should be a utf-8 string ending with '\0'. + + +'W' +wav file. +The data begins with a network byte-ordered glong to identify the wav +file's size, immediately followed by the file's content. + +'P' +png file. +The data begins with a network byte-ordered glong to identify the png +file's size, immediately followed by the file's content. + +'X' +this type identifier is reserved for experimental extensions. + + +{5}. Tree Dictionary +The tree dictionary support is used for information viewing, etc. + +A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz +and sometreedict.dict.dz. + +It is better to compress the .tdx file, as it is always load into memory. + +The .ifo file has the following format: + +StarDict's treedict ifo file +version=2.4.2 +[options] + +Available options: + +bookname= // required +tdxfilesize= // required +wordcount= +author= +email= +website= +description= +date= +sametypesequence= + +wordcount is only used for info view in the dict manage dialog, so it is not +important in tree dictionary. + +The .tdx file is just the word list. +----------- +The word list is a tree list of word entries. + +Each entry in the word list contains four fields, one after the other: + word_str; // a utf-8 string terminated by '\0'. + word_data_offset; // word data's offset in .dict file + word_data_size; // word data's total size in .dict file. it can be 0. + word_subentry_count; //have many sub word this entry has, 0 means none. + +Subentry is immidiately followed by its parent entry. This make the order is +just as when a tree list with all its nodes extended, then sort from top to +bottom. + +The .dict file's format is the same as the normal dictionary. + + + +{6}. More information. +You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and +"src/tools/*.cpp" for more information. + +If you have any questions, email me. :) + +Thanks to Will Robinson for cleaning up this file's +English. + +Hu Zheng +http://forlinux.yeah.net +2003.11.11 diff --git a/doc/sdcv.1 b/doc/sdcv.1 new file mode 100644 index 0000000..86351b7 --- /dev/null +++ b/doc/sdcv.1 @@ -0,0 +1,104 @@ +.TH SDCV 1 "2006-04-24" "sdcv-0.4.2" +.SH NAME +sdcv \- console version of StarDict program +.SH SYNOPSIS +.B sdcv +[ +.BI options +] +[list of words] +.SH DESCRIPTION +.I sdcv +is a simple, cross-platform text-based utility +for working with dictionaries in StarDict format. +Each word from "list of words" may be a string +with a leading '/' for using a Fuzzy search algorithm, +with a leading '|' for using full-text search, +and the string may contain '?' and '*' for regexp search. +It works in interactive and non-interactive mode. +To exit from interactive mode press Ctrl+D. +In interactive mode, +if sdcv was compiled with readline library support, +you can use the UP and DOWN keys to cycle through history. +.SH OPTIONS +.TP 8 +.B "\-h \-\-help" +Display help message and exit +.TP 8 +.B "\-v \-\-verbose" +Display version and exit +.TP 8 +.B "\-l \-\-list\-dicts" +Display list of available dictionaries and exit +.TP 8 +.B "\-u \-\-use\-dict filename" +For search use only dictionary with this bookname +.TP 8 +.B "\-n \-\-non\-interactive" +For use in scripts +.TP 8 +.B "\-x \-\-only\-data\-dir" +For use in scripts: only use the dictionaries in data-dir, do not search in user and system directories +.TP 8 +.B "\-e \-\-exact\-search" +Do not fuzzy-search for similar words, only return exact matches +.TP 8 +.B "\-j \-\-json" +Print the results of list-dicts and searches as json, not as plain text. +For use in automatically processing the results of a dictionary lookup. +.TP 8 +.B "\-\-utf8\-output" +Force sdcv to not convert to locale charset, output in utf8 +.TP 8 +.B "\-\-utf8\-input" +Force sdcv to not convert from locale charset, assume that +input is in utf8 +.TP 8 +.B "\-\-data\-dir path/to/directory" +Use this directory as the path to the stardict data directory. This means that +sdcv searches for dictionaries in data-dir/dic directory. +.TP 8 +.B "\-\-color" +Use ANSI escape codes for colorizing sdcv output (does not work with json output). +.SH FILES +.TP +/usr/share/stardict/dic +.TP +$(HOME)/.stardict/dic + +Place where sdcv expects to find dictionaries. +Instead of /usr/share/stardict/dic you can use any directory +you want, just set the STARDICT_DATA_DIR environment variable. +For example, if you have dictionaries in /mnt/data/stardict-dicts/dic, +set STARDICT_DATA_DIR to /mnt/data/stardict-dicts. +.TP +$(HOME)/.sdcv_history + +This file includes the last $(SDCV_HISTSIZE) words, which you sought with sdcv. +SDCV uses this file only if it was compiled with readline library support. +.TP +$(HOME)/.sdcv_ordering + +This is a text file containing one dictionary bookname per line. +It specifies in which order the results of a search should be shown. +.SH ENVIRONMENT +Environment Variables Used By \fIsdcv\fR: +.TP 20 +.B STARDICT_DATA_DIR +If set, sdcv uses this variable as the data directory, this means that sdcv +searches dictionaries in $\fBSTARDICT_DATA_DIR\fR\\dic +.TP 20 +.B SDCV_HISTSIZE +If set, sdcv writes in $(HOME)/.sdcv_history the last $(SDCV_HISTSIZE) words, +which you look up using sdcv. If it is not set, then the last 2000 words are saved in $(HOME)/.sdcv_history. +.TP 20 +.B SDCV_PAGER +If SDCV_PAGER is set, its value is used as the name of the program +to use to display the dictionary article. +.SH BUGS +Email bug reports to dushistov at mail dot ru. Be sure to include the word +"sdcv" somewhere in the "Subject:" field. +.SH AUTHORS +Evgeniy A. Dushistov, Hu Zheng +.SH SEE ALSO +stardict(1), http://sdcv.sourceforge.net/, http://stardict.sourceforge.net diff --git a/doc/uk/sdcv.1 b/doc/uk/sdcv.1 new file mode 100644 index 0000000..ff3b270 --- /dev/null +++ b/doc/uk/sdcv.1 @@ -0,0 +1,84 @@ +.TH SDCV 1 "2004-12-06" "sdcv-0.4" +.SH NAME +sdcv \- консольна версія Зоряного словника [Stardict] +.SH SYNOPSIS +.B sdcv +[ +.BI options +] +[list of words] +.SH DESCRIPTION +.I sdcv +sdcv проста, міжплатформена текстова утиліта для роботи із +словниками у форматі Зоряного словника [StarDict]. +Слово зі "списку слів", може бути рядком з початковим слешем '/' +щоб задіяти нечіткий пошуковий алгоритм, рядок, може +містити '?' і '*' для використання пошуку з регулярними виразами. +Утиліта працює в діалоговому та не в інтерактивному режимах. +Щоб вийти з діалогового режиму натискають Ctrl+D. +У діалоговому режимі, якщо sdcv був скомпільований з підтримкою +бібліотеки readline, Ви можете використовувати клавіші ДОГОРИ +та ВНИЗ для роботи з хронологією. +.SH OPTIONS +.TP 8 +.B "\-h \-\-help" +відображає повідомлення довідки та виходить +.TP 8 +.B "\-v \-\-verbose" +відображає версію та виходить +.TP 8 +.B "\-l \-\-list\-dicts" +відображає список доступних словників та виходить +.TP 8 +.B "\-u \-\-use\-dict filename" +для пошуку з використанням лише словника з цим іменем(bookname) +.TP 8 +.B "\-n \-\-non\-interactive" +для використання в скриптах +.TP 8 +.B "\-\-utf8\-output" +Заставити sdcv розмовляти не в системному кодуванні locale, а робити вивід в utf8 +.TP 8 +.B "\-\-utf8\-input" +Заставити sdcv слухати не в системному кодуванні locale, а припускати що це +ввід в utf8 +.TP 8 +.B "\-\-data\-dir path/to/directory" +Використовуйте цю теку як шлях до теки даних зоряного словника [stardict]. +Це значає, що sdcv шукає словники у теці data-dir/dic. +.SH FILES +.TP +/usr/share/stardict/dic +.TP +$(HOME)/.stardict/dic + +Місце, де sdcv очікує знайти словники. +Замість шляху /usr/share/stardict/dic Ви можете використовувати все, +що Ви хочете, лише встановіть змінну оточення STARDICT_DATA_DIR. +Наприклад, якщо Ви маєте словники у теці /mnt/data/stardict-dicts/dic, +встановіть STARDICT_DATA_DIR у /mnt/data/stardict-dicts. +.TP +$(HOME)/.sdcv_history + +Цей файл містить останні $(SDCV_HISTSIZE) слова, які Ви шукали з sdcv. +SDCV використовує цей файл при умові, якщо sdcv був скомпільований +з підтримкою бібліотеки readline. + +.SH ENVIRONMENT +Змінні оточення для \fIsdcv\fR: +.TP 20 +.B STARDICT_DATA_DIR +Якщо встановлена, sdcv використає цю змінну як теку даних, це означає, +що sdcv шукатиме словники у $\fBSTARDICT_DATA_DIR\fR\dic +.TP 20 +.B SDCV_HISTSIZE +Якщо встановлена, sdcv писатиме у $(HOME)/.sdcv_history лише +останні $(SDCV_HISTSIZE) слова, які Ви шукали з sdcv. Якщо не встановлена, +то збірігатиметься останніх 2000 слів у $(HOME)/.sdcv_history. +.SH BUGS +Звіти про помилки висилайте на адресу dushistov на mail крапка ru. +Не забувайте включати слово "sdcv" десь у полі "Тема:". +.SH AUTHORS +Эвгений А. Душистов, Hu Zheng +.SH SEE ALSO +stardict(1), http://sdcv.sourceforge.net/, http://stardict.sourceforge.net -- cgit v1.2.3