Add `parsebib-clean-TeX-markup`.

Include the functionality to make TeX markup suitable for display, which was formerly included in Ebib, into parsebib. This makes this functionality available for other packages using parsebib.
author: Joost Kremers <joostkremers@fastmail.fm> 2022-06-16 17:06:19 +0200
committer: Joost Kremers <joostkremers@fastmail.fm> 2022-06-16 17:06:19 +0200
commit: 83a77ea7e51f78093b1a0f1eb51615ec9295829d (patch)
tree: fc3b1971c74a3299fcb7900dc379ea9c64c71639
parent: 4c65ec2cd316b8e8a0f5d5ceee09eeb59944b8d8 (diff)
3 files changed, 433 insertions, 20 deletions
diff --git a/README.md b/README.md
index 9ca7341..ae57401 100644
--- a/README.md
+++ b/README.md
@@ -91,7 +91,7 @@ Collect all `@comments` in the current buffer and return them as a list.
 Find and return the BibTeX dialect for the current buffer. The BibTeX dialect is either `BibTeX` or `biblatex` and can be defined in a local-variable block at the end of the file.
 
 
-#### `parsebib-parse-bib-buffer (&keys entries strings expand-strings inheritance fields)` ####
+#### `parsebib-parse-bib-buffer (&keys entries strings expand-strings inheritance fields replace-TeX)` ####
 
 Collect all BibTeX data in the current buffer. Return a five-element list:
 
@@ -103,6 +103,8 @@ If the arguments `entries` and `strings` are present, they should be hash tables
 
 The argument `expand-strings` functions as the same-name argument in `parsebib-collect-strings`, and the arguments `inheritance` and `fields` function as the same-name arguments in `parsebib-collect-bib-entries`.
 
+If `replace-TeX` in set, (La)TeX markup in field values is replaced with text that is more suitable for display. The variable `parsebib-TeX-markup-replace-alist` determines what exactly is replaced. (Note: its definition in `parsebib.el` is more informative than its actual value; see also the relevant tests in `parsebib-test.el` for examples of its use.)
+
 Note that `parsebib-parse-bib-buffer` only makes one pass through the buffer. It is therefore a bit faster than calling all the `parsebib-collect-*` functions above in a row, since that would require making four passes through the buffer.
 
 
@@ -125,14 +127,14 @@ All functions here take an optional position argument, which is the position in
 Find the first BibTeX item following `pos`, where an item is either a BibTeX entry, or a `@Preamble`, `@String`, or `@Comment`. This function returns the item's type as a string, i.e., either `"preamble"`, `"string"`, or `"comment"`, or the entry type. Note that the `@` is *not* part of the returned string. This function moves point into the correct position to start reading the actual contents of the item, which is done by one of the following functions.
 
 
-#### `parsebib-read-entry (type &optional pos strings keep-fields)` ####
+#### `parsebib-read-entry (type &optional pos strings keep-fields replace-TeX)` ####
 #### `parsebib-read-string (&optional pos strings)` ####
 #### `parsebib-read-preamble (&optional pos)` ####
 #### `parsebib-read-comment (&optional pos)` ####
 
 These functions do what their names suggest: read one single item of the type specified. Each takes the `pos` argument just mentioned. In addition, `parsebib-read-string` and `parsebib-read-entry` take an extra argument, a hash table of `@string` definitions. When provided, abbreviations in the `@string` definitions or in field values are expanded. Note that `parsebib-read-entry` takes the entry type (as returned by `parsebib-find-next-entry`) as argument.
 
-`parsebib-read-entry` also takes an optional argument `keep-fields`. This is a list of names of the fields that should be included in the entries returned. Fields not in this list are ignored (except for `=type=` and `=key=`, which are always included). Note that the field names should be strings; comparison is case-insensitive.
+`parsebib-read-entry` takes two more optional arguments: `keep-fields` and `replace-TeX`. `keep-fields` is a list of names of the fields that should be included in the entries returned. Fields not in this list are ignored (except for `=type=` and `=key=`, which are always included). Note that the field names should be strings; comparison is case-insensitive. `replace-TeX` is a flag indicating whether TeX markup in field values should be replaced with something that's more suitable for display.
 
 The reading functions return the contents of the item they read: `parsebib-read-preamble` and `parsebib-read-comment` return the text as a string. `parsebib-read-string` returns a cons cell of the form `(<abbrev> . <string>)`, and `parsebib-read-entry` returns the entry as an alist of `(<field> . <value>)` pairs. One of these pairs contains the entry type `=type=`, and one contains the entry key. These have the keys `"=key="` and `"=type="`, respectively.
 
@@ -229,7 +231,7 @@ This is a high-level function meant for retrieving bibliographic entries in such
 
 `parsebib-parse` basically just calls `parsebib-parse-bib-buffer` or `parsebib-parse-json-buffer` as appropriate and passes its arguments on to those functions. The argument `entries` is passed to both, as is `fields`. The field names in `fields` need to be strings, regardless of the file format, though. `parsebib-parse` converts the strings to symbols when it parses a `.json` file. The `strings` argument is only passed to `parsebib-parse-bib-buffer`, since there are obviously no `@String`s in a `.json` file.
 
-The `display` argument controls the way in which the entry data is returned. By default, it returns the data in a way that is suitable for display. For `.bib` files, this means that `@String` abbreviations are expanded and cross-references are resolved. For `.json` files, it means that fields are returned as strings and that month and day parts in date fields are ignored.
+The `display` argument controls the way in which the entry data is returned. By default, it returns the data in a way that is suitable for display. For `.bib` files, this means that `@String` abbreviations are expanded, cross-references are resolved and TeX markup in field values is removed or replaced with Unicode characters. For `.json` files, it means that fields are returned as strings and that month and day parts in date fields are ignored.
 
 See the doc strings of `parsebib-parse`, `parsebib-parse-bib-buffer` and `parsebib-parse-json-buffer` for details on the meaning of `display`.
 
diff --git a/parsebib.el b/parsebib.el
index f382cd2..cb6b3f4 100644
--- a/parsebib.el
+++ b/parsebib.el
@@ -6,7 +6,7 @@
 ;; Author: Joost Kremers <joostkremers@fastmail.fm>
 ;; Maintainer: Joost Kremers <joostkremers@fastmail.fm>
 ;; Created: 2014
-;; Version: 3.0
+;; Version: 4.0
 ;; Keywords: text bibtex
 ;; URL: https://github.com/joostkremers/parsebib
 ;; Package-Requires: ((emacs "25.1"))
@@ -179,6 +179,198 @@ target field is set to the symbol `none'.")
 (defconst parsebib--key-regexp "[^\"@\\#%',={} \t\n\f]+" "Regexp describing a licit key.")
 (defconst parsebib--entry-start "^[ \t]*@" "Regexp describing the start of an entry.")
 
+(defun parsebib--build-TeX-accent-command-regexp (command accent)
+  "Build a regexp-replacement pair for LaTeX diacritics.
+
+COMMAND is the name of a TeX or LaTeX command (without
+backslash), ACCENT is the character (usually a Unicode combining
+character) that COMMAND generates.  Both COMMAND and ACCENT must
+be strings.
+
+The return value is a cons cell that can be included in
+`parsebib-TeX-markup-replace-alist' directly.
+
+The car of this cons cell is a regexp matching the TeX or LaTeX
+COMMAND, capturing exactly one obligatory argument.  The
+cdr is a replacement string, the concatenation of \"\\1\" and
+ACCENT.
+
+Specifically, the car regexp matches a string composed of a
+backslash, followed by COMMAND and a single letter (i.e.
+matching [[:alpha:]]).  The regexp matches if the letter is in
+curly braces (\"\\d{a}\") or if it is separated from COMMAND by
+white space (\"\\d a\".  If COMMAND is a non-letter character,
+the regexp also matches if the letter follows COMMAND
+immediately, without white space or curly braces (\"\\'a\").  In
+all variants, the letter is captured with group number 1."
+  (cons
+   (rx-to-string
+    `(: "\\" ,command
+	(or (: (* blank) "{" (group-n 1 letter) "}")
+	    (: (,(if (string-match "[a-zA-Z]" command) '+ '*) blank)
+	       (group-n 1 letter))))
+    t)
+   (rx-to-string `(: (backref 1) ,accent) t)))
+
+(defun parsebib--build-TeX-command-regexp (command replacement)
+  "Build a regexp-replacement pair for a LaTeX command.
+
+COMMAND is the name of a TeX or LaTeX command (without
+backslash).  Both COMMAND and REPLACEMENT must be strings.
+
+The return value is a cons cell: its car is a regexp matching
+COMMAND, its cdr is REPLACEMENT.  This cons cell can be included
+in `parsebib-TeX-markup-replace-alist' directly.
+
+Specifically, the regexp matches a string composed of a backslash
+followed by COMMAND and terminated by a pair of curly
+braces (`\\COMMAND{}'), a word ending or a space.  Such a
+trailing space will be included in the overall match."
+  (cons
+   (rx-to-string
+    `(: "\\" ,(if (listp command) `(or ,@command) command)
+	;; If a command is terminated by a space, LaTeX includes that
+	;; space in the command itself, so it is not printed (like the
+	;; behaviour for a following {}) Accordingly, if there is one,
+	;; include that space in the replaced string by matching on it
+	;; first.
+	(or (+ blank) word-end "{}"))
+    t)
+   replacement))
+
+(defun parsebib--convert-tex-italics (str)
+  "Return first sub-expression match in STR, in italics."
+  (propertize (match-string 1 str) 'face 'italic))
+
+(defun parsebib--convert-tex-bold (str)
+  "Return first sub-expression match in STR, in bold."
+  (propertize (match-string 1 str) 'face 'bold))
+
+(defun parsebib--convert-tex-small-caps (str)
+  "Return first sub-expression match in STR, capitalised."
+  (upcase (match-string 1 str)))
+
+(defvar parsebib-TeX-markup-replace-alist
+  `(;; Commands defined to work in both math and text mode.  (Dashes are
+    ;; separate because they are not backslash-escaped, unlike everything else.)
+    ("---\\|\\\\textemdash\\(?: +\\|{}\\|\\>\\)" . "\N{EM DASH}")
+    ("--\\|\\\\textendash\\(?: +\\|{}\\|\\>\\)"  . "\N{EN DASH}")
+    ,@(mapcar
+       (apply-partially 'apply 'parsebib--build-TeX-command-regexp)
+       '((("ddag" "textdaggerdbl")        "\N{DOUBLE DAGGER}")
+         (("dag" "textdagger")            "\N{DAGGER}")
+	 ("textpertenthousand"            "\N{PER TEN THOUSAND SIGN}")
+         ("textperthousand"               "\N{PER MILLE SIGN}")
+	 ("textquestiondown"              "\N{INVERTED QUESTION MARK}")
+         ("P"                             "\N{PILCROW SIGN}")
+         (("$" "textdollar")              "$")
+	 ("S"                             "\N{SECTION SIGN}")
+         (("ldots" "dots" "textellipsis") "\N{HORIZONTAL ELLIPSIS}")))
+
+    ;; Text-mode Accents
+    ,@(mapcar
+       (apply-partially 'apply 'parsebib--build-TeX-accent-command-regexp)
+       '(("\"" "\N{COMBINING DIAERESIS}")
+         ("'"  "\N{COMBINING ACUTE ACCENT}")
+         ("."  "\N{COMBINING DOT ABOVE}")
+         ("="  "\N{COMBINING MACRON}")
+	 ("^"  "\N{COMBINING CIRCUMFLEX ACCENT}")
+         ("`"  "\N{COMBINING GRAVE ACCENT}")
+	 ("b"  "\N{COMBINING MACRON BELOW}")
+         ("c"  "\N{COMBINING CEDILLA}")
+         ("d"  "\N{COMBINING DOT BELOW}")
+         ("H"  "\N{COMBINING DOUBLE ACUTE ACCENT}")
+         ("k"  "\N{COMBINING OGONEK}")
+         ("U"  "\N{COMBINING DOUBLE VERTICAL LINE ABOVE}")
+	 ("u"  "\N{COMBINING BREVE}")
+         ("v"  "\N{COMBINING CARON}")
+         ("~"  "\N{COMBINING TILDE}")
+         ("|"  "\N{COMBINING COMMA ABOVE}")
+         ("f"  "\N{COMBINING INVERTED BREVE}")
+         ("G"  "\N{COMBINING DOUBLE GRAVE ACCENT}")
+         ("h"  "\N{COMBINING HOOK ABOVE}")
+         ("C"  "\N{COMBINING DOUBLE GRAVE ACCENT}")
+         ("r"  "\N{COMBINING RING ABOVE}")))
+
+    ;; LaTeX2 Escapable "Special" Characters
+    ("\\\\%" . "%") ("\\\\&" . "&") ("\\\\#" . "#")
+
+    ;; Quotes
+    ("``" . "\N{LEFT DOUBLE QUOTATION MARK}")
+    ("`"  . "\N{LEFT SINGLE QUOTATION MARK}")
+    ("''" . "\N{RIGHT DOUBLE QUOTATION MARK}")
+    ("'"  . "\N{RIGHT SINGLE QUOTATION MARK}")
+
+    ;; Formatting Commands
+    ("\\\\textit{\\(.*?\\)}" . parsebib--convert-tex-italics)
+    ("\\\\emph{\\(.*?\\)}"   . parsebib--convert-tex-italics)
+    ("\\\\textbf{\\(.*?\\)}" . parsebib--convert-tex-bold)
+    ("\\\\textsc{\\(.*?\\)}" . parsebib--convert-tex-small-caps)
+
+    ;; Non-ASCII Letters (Excluding Accented Letters)
+    ,@(mapcar
+       (apply-partially 'apply 'parsebib--build-TeX-command-regexp)
+       '(("AA" "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}")
+         ("AE" "\N{LATIN CAPITAL LETTER AE}")
+         ("DH" "\N{LATIN CAPITAL LETTER ETH}")
+         ("DJ" "\N{LATIN CAPITAL LETTER ETH}")
+         ("L"  "\N{LATIN CAPITAL LETTER L WITH STROKE}")
+	 ("SS" "\N{LATIN CAPITAL LETTER SHARP S}")
+         ("NG" "\N{LATIN CAPITAL LETTER ENG}")
+         ("OE" "\N{LATIN CAPITAL LIGATURE OE}")
+         ("O"  "\N{LATIN CAPITAL LETTER O WITH STROKE}")
+         ("TH" "\N{LATIN CAPITAL LETTER THORN}")
+
+         ("aa" "\N{LATIN SMALL LETTER A WITH RING ABOVE}")
+         ("ae" "\N{LATIN SMALL LETTER AE}")
+         ("dh" "\N{LATIN SMALL LETTER ETH}")
+         ("dj" "\N{LATIN SMALL LETTER ETH}")
+         ("l"  "\N{LATIN SMALL LETTER L WITH STROKE}")
+	 ("ss" "\N{LATIN SMALL LETTER SHARP S}")
+         ("ng" "\N{LATIN SMALL LETTER ENG}")
+         ("oe" "\N{LATIN SMALL LIGATURE OE}")
+         ("o"  "\N{LATIN SMALL LETTER O WITH STROKE}")
+         ("th" "\N{LATIN SMALL LETTER THORN}")
+
+	 ("ij" "ij")
+         ("i"  "\N{LATIN SMALL LETTER DOTLESS I}")
+         ("j"  "\N{LATIN SMALL LETTER DOTLESS J}")))
+
+    ;; Commands with obligatory non-empty argument
+    ("\\\\[a-zA-Z*]+\\(?:\\[.*\\]\\)?{\\(.+?\\)}" . "\\1")
+
+    ;; Commands without arguments, optionally terminated by empty braces
+    ("\\(\\\\[a-zA-Z*]+\\)\\(?:\\[.*\\]\\)?\\(?:{}\\)?" . "\\1")
+
+    ;; Collapse white space
+    ("[[:blank:]]+" . " ")
+
+    ;; Remove all remaining {braces}
+    ("{" . "") ("}" . ""))
+  "Alist of strings and replacements for TeX markup.
+This is used in `parsebib-clean-TeX-markup' to make TeX markup more
+suitable for display.  Each item in the list consists of a regexp
+and its replacement.  The replacement can be a string (which will
+simply replace the match) or a function (the match will be
+replaced by the result of calling the function on the match
+string).  Earlier elements are evaluated before later ones, so if
+one string is a subpattern of another, the second must appear
+later (e.g. \"''\" is before \"'\").")
+
+(defun parsebib-clean-TeX-markup (string)
+  "Return STRING without TeX markup.
+Any substring matching the car of a cell in
+`parsebib-TeX-markup-replace-alist' is replaced with the
+corresponding cdr (if the cdr is a string), or with the result of
+calling the cdr on the match (if it is a function).  This is done
+with `replace-regexp-in-string', which see for details."
+  (let ((case-fold-search nil))
+    (save-match-data
+      (cl-loop for (pattern . replacement) in parsebib-TeX-markup-replace-alist
+	       do (setq string (replace-regexp-in-string
+				pattern replacement string))
+	       finally return string))))
+
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;; Matching and parsing stuff ;;
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@@ -228,7 +420,7 @@ if a matching delimiter was found."
     ;; If forward-sexp does not result in an error, we want to return t.
     t))
 
-(defun parsebib--parse-bib-value (limit &optional strings)
+(defun parsebib--parse-bib-value (limit &optional strings replace-TeX)
   "Parse value at point.
 A value is either a field value or a @String expansion.  Return
 the value as a string.  No parsing is done beyond LIMIT, but note
@@ -238,7 +430,11 @@ STRINGS, if non-nil, is a hash table of @String definitions.
 @String abbrevs in the value to be parsed are then replaced with
 their expansions.  Additionally, newlines in field values are
 removed, white space is reduced to a single space and braces or
-double quotes around field values are removed."
+double quotes around field values are removed.
+
+REPLACE-TEX indicates whether TeX markup should be replaced with
+ASCII/Unicode characters.  See the variable
+`parsebib-TeX-markup-replace-alist' for details."
   (let (res)
     (while (and (< (point) limit)
                 (not (looking-at-p ",")))
@@ -253,9 +449,12 @@ double quotes around field values are removed."
        ((looking-at "[[:space:]]*#[[:space:]]*")
         (goto-char (match-end 0)))
        (t (forward-char 1)))) ; So as not to get stuck in an infinite loop.
-    (if strings
-        (string-join (parsebib--expand-strings (nreverse res) strings))
-      (string-join (nreverse res) " # "))))
+    (setq res (if strings
+                  (string-join (parsebib--expand-strings (nreverse res) strings))
+                (string-join (nreverse res) " # ")))
+    (if replace-TeX
+        (parsebib-clean-TeX-markup res)
+      res)))
 
 ;;;;;;;;;;;;;;;;;;;;;
 ;; Expanding stuff ;;
@@ -456,7 +655,7 @@ point."
    into hashid-fields
    finally return (mapconcat #'identity hashid-fields "")))
 
-(defun parsebib-read-entry (type &optional pos strings fields)
+(defun parsebib-read-entry (type &optional pos strings fields replace-TeX)
   "Read a BibTeX entry of type TYPE at the line POS is on.
 TYPE should be a string and should not contain the @
 sign.  The return value is the entry as an alist of (<field> .
@@ -484,7 +683,11 @@ FIELDS is a list of the field names (as strings) to be read and
 included in the result.  Fields not in the list are ignored,
 except \"=key=\" and \"=type=\", which are always included.  Case
 is ignored when comparing fields to the list in FIELDS.  If
-FIELDS is nil, all fields are returned."
+FIELDS is nil, all fields are returned.
+
+REPLACE-TEX indicates whether TeX markup should be replaced with
+ASCII/Unicode characters.  See the variable
+`parsebib-TeX-markup-replace-alist' for details."
   (unless (member-ignore-case type '("comment" "preamble" "string"))
     (when pos (goto-char pos))
     (beginning-of-line)
@@ -501,7 +704,7 @@ FIELDS is nil, all fields are returned."
                     (buffer-substring-no-properties beg (point)))))
         (or key (setq key "")) ; If no key was found, we pretend it's empty and try to read the entry anyway.
         (skip-chars-forward "^," limit) ; Move to the comma after the entry key.
-        (let ((fields (cl-loop for field = (parsebib--parse-bibtex-field limit strings fields)
+        (let ((fields (cl-loop for field = (parsebib--parse-bibtex-field limit strings fields replace-TeX)
                                while field
                                if (consp field) collect field)))
           (push (cons "=type=" type) fields)
@@ -510,7 +713,7 @@ FIELDS is nil, all fields are returned."
               (push (cons "=hashid=" (secure-hash 'sha256 (parsebib--get-hashid-string fields))) fields))
           (nreverse fields))))))
 
-(defun parsebib--parse-bibtex-field (limit &optional strings fields)
+(defun parsebib--parse-bibtex-field (limit &optional strings fields replace-TeX)
   "Parse the field starting at point.
 Do not search beyond LIMIT (a buffer position).  Return a
 cons (FIELD . VALUE), or nil if no field was found.
@@ -522,7 +725,11 @@ FIELDS is a list of the field names (as strings) to be read and
 included in the result.  Fields not in the list are ignored,
 except \"=key=\" and \"=type=\", which are always included.  Case
 is ignored when comparing fields to the list in FIELDS.  If
-FIELDS is nil, all fields are returned."
+FIELDS is nil, all fields are returned.
+
+REPLACE-TEX indicates whether TeX markup should be replaced with
+ASCII/Unicode characters.  See the variable
+`parsebib-TeX-markup-replace-alist' for details."
   (skip-chars-forward "\"#%'(),={} \n\t\f" limit) ; Move to the first char of the field name.
   (unless (>= (point) limit)                      ; If we haven't reached the end of the entry.
     (let ((beg (point)))
@@ -530,7 +737,7 @@ FIELDS is nil, all fields are returned."
           (let ((field-type (buffer-substring-no-properties beg (point))))
             (if (or (not fields)
                     (member-ignore-case field-type fields))
-                (cons field-type (parsebib--parse-bib-value limit strings))
+                (cons field-type (parsebib--parse-bib-value limit strings replace-TeX))
               (parsebib--parse-bib-value limit) ; Skip over the field value.
               :ignore)))))) ; Ignore this field but keep the `cl-loop' in `parsebib-read-entry' going.
 
@@ -650,7 +857,7 @@ file.  Return nil if no dialect is found."
                      (string-match (concat "bibtex-dialect: " (regexp-opt (mapcar #'symbol-name bibtex-dialect-list) t)) comment))
             (intern (match-string 1 comment))))))))
 
-(cl-defun parsebib-parse-bib-buffer (&key entries strings expand-strings inheritance fields)
+(cl-defun parsebib-parse-bib-buffer (&key entries strings expand-strings inheritance fields replace-TeX)
   "Parse the current buffer and return all BibTeX data.
 Return a list of five elements: a hash table with the entries, a
 hash table with the @String definitions, a list of @Preamble
@@ -685,7 +892,11 @@ FIELDS is a list of the field names (as strings) to be read and
 included in the result.  Fields not in the list are ignored,
 except \"=key=\" and \"=type=\", which are always included.  Case
 is ignored when comparing fields to the list in FIELDS.  If
-FIELDS is nil, all fields are returned."
+FIELDS is nil, all fields are returned.
+
+REPLACE-TEX indicates whether TeX markup should be replaced with
+ASCII/Unicode characters.  See the variable
+`parsebib-TeX-markup-replace-alist' for details."
   (save-excursion
     (goto-char (point-min))
     (or (and (hash-table-p entries)
@@ -710,7 +921,7 @@ FIELDS is nil, all fields are returned."
                 ((cl-equalp item "comment")
                  (push (parsebib-read-comment) comments))
                 ((stringp item)
-                 (let ((entry (parsebib-read-entry item nil (if expand-strings strings) fields)))
+                 (let ((entry (parsebib-read-entry item nil (if expand-strings strings) fields replace-TeX)))
                    (when entry
                      (puthash (cdr (assoc-string "=key=" entry)) entry entries))))))
       (when inheritance (parsebib-expand-xrefs entries (if (eq inheritance t) dialect inheritance)))
@@ -1015,7 +1226,8 @@ details.  If FIELDS is nil, all fields are returned."
                                          :strings strings
                                          :expand-strings display
                                          :inheritance display
-                                         :fields fields))
+                                         :fields fields
+                                         :replace-TeX display))
              ((string= (file-name-extension file t) ".json")
               (parsebib-parse-json-buffer :entries entries
                                           :stringify display
diff --git a/test/parsebib-test.el b/test/parsebib-test.el
index fff4c23..f60b233 100644
--- a/test/parsebib-test.el
+++ b/test/parsebib-test.el
@@ -116,4 +116,203 @@
   (should (string= (parsebib-stringify-json-field '(categories . ["fiction" "horror"]))
                    "fiction, horror")))
 
+;;; Tests for `parsebib-clean-TeX-markup'
+
+(ert-deftest parsebib-clean-TeX-markup-dashes ()
+  (should (equal (parsebib-clean-TeX-markup "---") "—"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash") "—"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash and") "—and"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash  and") "—and"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash{}") "—"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash{}and") "—and"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash{} and") "— and"))
+  (should (equal (parsebib-clean-TeX-markup "\\textemdash{}  and") "— and"))
+
+  (should (equal (parsebib-clean-TeX-markup "--") "–"))
+  (should (equal (parsebib-clean-TeX-markup "\\textendash") "–"))
+  (should (equal (parsebib-clean-TeX-markup "\\textendash{}") "–")))
+
+(ert-deftest parsebib-clean-TeX-markup-math-and-text-mode-commands ()
+  (should (equal (parsebib-clean-TeX-markup "\\ddag{} \\textdaggerdbl") "‡ ‡"))
+  (should (equal (parsebib-clean-TeX-markup "10\\textpertenthousand") "10‱"))
+  (should (equal (parsebib-clean-TeX-markup "200\\textperthousand.") "200‰."))
+  (should (equal (parsebib-clean-TeX-markup "\\textquestiondown") "¿"))
+  (should (equal (parsebib-clean-TeX-markup "\\P 3.2") "¶3.2"))
+  (should (equal (parsebib-clean-TeX-markup "\\$ \\textdollar") "$$"))
+  (should (equal (parsebib-clean-TeX-markup "\\S 5.2") "§5.2"))
+  (should (equal (parsebib-clean-TeX-markup "\\ldots{} [\\dots] \\textellipsis and")
+                 "… […] …and")))
+
+(ert-deftest parsebib-clean-TeX-markup-nonletter-diacritics-without-braces ()
+  ;; No space is needed after a nonletter diacritic commands.
+  (should (equal (parsebib-clean-TeX-markup "\\\"a") "a\N{COMBINING DIAERESIS}"))
+  (should (equal (parsebib-clean-TeX-markup "\\'a")  "a\N{COMBINING ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\.a")  "a\N{COMBINING DOT ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\=a")  "a\N{COMBINING MACRON}"))
+  (should (equal (parsebib-clean-TeX-markup "\\^a")  "a\N{COMBINING CIRCUMFLEX ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\`a")  "a\N{COMBINING GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\~a")  "a\N{COMBINING TILDE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\|a")  "a\N{COMBINING COMMA ABOVE}"))
+  ;; Spaces are possible, though:
+  (should (equal (parsebib-clean-TeX-markup "\\' a")  "a\N{COMBINING ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\'  a")  "a\N{COMBINING ACUTE ACCENT}")))
+
+(ert-deftest parsebib-clean-TeX-markup-letter-diacritics-without-braces ()
+  ;; Diacritic commands that consist of a single letter require a space.
+  (should (equal (parsebib-clean-TeX-markup "\\b a") "a\N{COMBINING MACRON BELOW}"))
+  (should (equal (parsebib-clean-TeX-markup "\\c c") "c\N{COMBINING CEDILLA}"))
+  (should (equal (parsebib-clean-TeX-markup "\\d a") "a\N{COMBINING DOT BELOW}"))
+  (should (equal (parsebib-clean-TeX-markup "\\H a") "a\N{COMBINING DOUBLE ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\k a") "a\N{COMBINING OGONEK}"))
+  (should (equal (parsebib-clean-TeX-markup "\\U a") "a\N{COMBINING DOUBLE VERTICAL LINE ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\u a") "a\N{COMBINING BREVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\v a") "a\N{COMBINING CARON}"))
+  (should (equal (parsebib-clean-TeX-markup "\\f a") "a\N{COMBINING INVERTED BREVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\G a") "a\N{COMBINING DOUBLE GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\h a") "a\N{COMBINING HOOK ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\C a") "a\N{COMBINING DOUBLE GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\r a") "a\N{COMBINING RING ABOVE}"))
+  ;; More than one space should also work:
+  (should (equal (parsebib-clean-TeX-markup "\\b  a") "a\N{COMBINING MACRON BELOW}"))
+  (should (equal (parsebib-clean-TeX-markup "\\b   a") "a\N{COMBINING MACRON BELOW}"))
+  ;; It shouldn't work without space. Since something like "\ba after" is
+  ;; essentially a command without an (explicit) argument, it should remain
+  ;; unchanged.
+  (should (equal (parsebib-clean-TeX-markup "before \\ba after") "before \\ba after")))
+
+(ert-deftest parsebib-clean-TeX-markup-diacritics-with-braces ()
+  ;; Diacritic commands may use braces to mark the argument.
+  (should (equal (parsebib-clean-TeX-markup "\\\"{a}") "a\N{COMBINING DIAERESIS}"))
+  (should (equal (parsebib-clean-TeX-markup "\\'{a}")  "a\N{COMBINING ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\.{a}")  "a\N{COMBINING DOT ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\={a}")  "a\N{COMBINING MACRON}"))
+  (should (equal (parsebib-clean-TeX-markup "\\^{a}")  "a\N{COMBINING CIRCUMFLEX ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\`{a}")  "a\N{COMBINING GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\b{a}")  "a\N{COMBINING MACRON BELOW}"))
+  (should (equal (parsebib-clean-TeX-markup "\\c{c}")  "c\N{COMBINING CEDILLA}"))
+  (should (equal (parsebib-clean-TeX-markup "\\d{a}")  "a\N{COMBINING DOT BELOW}"))
+  (should (equal (parsebib-clean-TeX-markup "\\H{a}")  "a\N{COMBINING DOUBLE ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\k{a}")  "a\N{COMBINING OGONEK}"))
+  (should (equal (parsebib-clean-TeX-markup "\\U{a}")  "a\N{COMBINING DOUBLE VERTICAL LINE ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\u{a}")  "a\N{COMBINING BREVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\v{a}")  "a\N{COMBINING CARON}"))
+  (should (equal (parsebib-clean-TeX-markup "\\~{a}")  "a\N{COMBINING TILDE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\|{a}")  "a\N{COMBINING COMMA ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\f{a}")  "a\N{COMBINING INVERTED BREVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\G{a}")  "a\N{COMBINING DOUBLE GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\h{a}")  "a\N{COMBINING HOOK ABOVE}"))
+  (should (equal (parsebib-clean-TeX-markup "\\C{a}")  "a\N{COMBINING DOUBLE GRAVE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\r{a}")  "a\N{COMBINING RING ABOVE}"))
+  ;; There may be spaces between the command and the argument.
+  (should (equal (parsebib-clean-TeX-markup "\\' {a}")  "a\N{COMBINING ACUTE ACCENT}"))
+  (should (equal (parsebib-clean-TeX-markup "\\'  {a}")  "a\N{COMBINING ACUTE ACCENT}")))
+
+(ert-deftest parsebib-clean-TeX-markup-escapable-characters ()
+  (should (equal (parsebib-clean-TeX-markup "percent: \\%  ampersand: \\&  hash: \\#")
+                 "percent: % ampersand: & hash: #")))
+
+(ert-deftest parsebib-clean-TeX-markup-quotes ()
+  (should (equal (parsebib-clean-TeX-markup "``double'' quotes") "\N{LEFT DOUBLE QUOTATION MARK}double\N{RIGHT DOUBLE QUOTATION MARK} quotes"))
+  (should (equal (parsebib-clean-TeX-markup "`single' quotes") "\N{LEFT SINGLE QUOTATION MARK}single\N{RIGHT SINGLE QUOTATION MARK} quotes")))
+
+(ert-deftest parsebib-clean-TeX-markup-textit ()
+  (should (equal-including-properties
+           (parsebib-clean-TeX-markup "The verb \\textit{krijgen} as an undative verb.")
+           #("The verb krijgen as an undative verb." 9 16
+             (face italic)))))
+
+(ert-deftest parsebib-clean-TeX-markup-emph ()
+  (should (equal-including-properties
+           (parsebib-clean-TeX-markup "The verb \\emph{krijgen} as an undative verb.")
+           #("The verb krijgen as an undative verb." 9 16
+             (face italic)))))
+
+(ert-deftest parsebib-clean-TeX-markup-textbf ()
+  (should (equal-including-properties
+           (parsebib-clean-TeX-markup "The verb \\textbf{krijgen} as an undative verb.")
+           #("The verb krijgen as an undative verb." 9 16
+             (face bold)))))
+
+(ert-deftest parsebib-clean-TeX-markup-textsc ()
+  (should (equal
+           (parsebib-clean-TeX-markup "The verb \\textsc{krijgen} as an undative verb.")
+           "The verb KRIJGEN as an undative verb.")))
+
+(ert-deftest parsebib-clean-TeX-markup-nonascii-letters-with-braces ()
+  ;; The braces should be removed and the space after it retained.
+  (should (equal (parsebib-clean-TeX-markup "\\AA{} and") "\N{LATIN CAPITAL LETTER A WITH RING ABOVE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\AE{} and") "\N{LATIN CAPITAL LETTER AE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\DH{} and") "\N{LATIN CAPITAL LETTER ETH} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\DJ{} and") "\N{LATIN CAPITAL LETTER ETH} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\L{} and")  "\N{LATIN CAPITAL LETTER L WITH STROKE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\SS{} and") "\N{LATIN CAPITAL LETTER SHARP S} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\NG{} and") "\N{LATIN CAPITAL LETTER ENG} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\OE{} and") "\N{LATIN CAPITAL LIGATURE OE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\O{} and")  "\N{LATIN CAPITAL LETTER O WITH STROKE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\TH{} and") "\N{LATIN CAPITAL LETTER THORN} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\aa{} and") "\N{LATIN SMALL LETTER A WITH RING ABOVE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\ae{} and") "\N{LATIN SMALL LETTER AE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\dh{} and") "\N{LATIN SMALL LETTER ETH} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\dj{} and") "\N{LATIN SMALL LETTER ETH} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\l{} and")  "\N{LATIN SMALL LETTER L WITH STROKE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\ss{} and") "\N{LATIN SMALL LETTER SHARP S} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\ng{} and") "\N{LATIN SMALL LETTER ENG} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\oe{} and") "\N{LATIN SMALL LIGATURE OE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\o{} and")  "\N{LATIN SMALL LETTER O WITH STROKE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\th{} and") "\N{LATIN SMALL LETTER THORN} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\ij{} and") "ij and"))
+  (should (equal (parsebib-clean-TeX-markup "\\i{} and")  "\N{LATIN SMALL LETTER DOTLESS I} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\j{} and")  "\N{LATIN SMALL LETTER DOTLESS J} and"))
+  ;; More than one space should work as well.
+  (should (equal (parsebib-clean-TeX-markup "\\AA{}  and")  "\N{LATIN CAPITAL LETTER A WITH RING ABOVE} and"))
+  (should (equal (parsebib-clean-TeX-markup "\\AA{}   and") "\N{LATIN CAPITAL LETTER A WITH RING ABOVE} and")))
+
+(ert-deftest parsebib-clean-TeX-markup-nonascii-letters-without-braces ()
+  ;; The space should be removed.
+  (should (equal (parsebib-clean-TeX-markup "\\AA n") "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\AE n") "\N{LATIN CAPITAL LETTER AE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\DH n") "\N{LATIN CAPITAL LETTER ETH}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\DJ n") "\N{LATIN CAPITAL LETTER ETH}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\L n")  "\N{LATIN CAPITAL LETTER L WITH STROKE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\SS n") "\N{LATIN CAPITAL LETTER SHARP S}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\NG n") "\N{LATIN CAPITAL LETTER ENG}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\OE n") "\N{LATIN CAPITAL LIGATURE OE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\O n")  "\N{LATIN CAPITAL LETTER O WITH STROKE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\TH n") "\N{LATIN CAPITAL LETTER THORN}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\aa n") "\N{LATIN SMALL LETTER A WITH RING ABOVE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\ae n") "\N{LATIN SMALL LETTER AE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\dh n") "\N{LATIN SMALL LETTER ETH}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\dj n") "\N{LATIN SMALL LETTER ETH}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\l n")  "\N{LATIN SMALL LETTER L WITH STROKE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\ss n") "\N{LATIN SMALL LETTER SHARP S}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\ng n") "\N{LATIN SMALL LETTER ENG}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\oe n") "\N{LATIN SMALL LIGATURE OE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\o n")  "\N{LATIN SMALL LETTER O WITH STROKE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\th n") "\N{LATIN SMALL LETTER THORN}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\ij n") "ijn"))
+  (should (equal (parsebib-clean-TeX-markup "\\i n")  "\N{LATIN SMALL LETTER DOTLESS I}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\j n")  "\N{LATIN SMALL LETTER DOTLESS J}n"))
+  ;; More than one space should work as well.
+  (should (equal (parsebib-clean-TeX-markup "\\AA  n")  "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}n"))
+  (should (equal (parsebib-clean-TeX-markup "\\AA   n") "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}n"))
+  ;; If there is no space, treat it as an unknown command.
+  (should (equal (parsebib-clean-TeX-markup "\\AAn")  "\\AAn")))
+
+(ert-deftest parsebib-clean-TeX-markup-other-commands ()
+  ;; Do not change commands with no arguments.
+  (should (equal (parsebib-clean-TeX-markup "\\LaTeX and") "\\LaTeX and"))
+  ;; Commands with an empty set of braces should remain, the braces should be removed.
+  (should (equal (parsebib-clean-TeX-markup "\\LaTeX{} and") "\\LaTeX and"))
+  ;; Obligatory arguments should replace the command.
+  (should (equal (parsebib-clean-TeX-markup "\\foo{bar} and") "bar and"))
+  ;; Optional arguments should be removed, even empty ones.
+  (should (equal (parsebib-clean-TeX-markup "\\foo[]{bar} and") "bar and"))
+  (should (equal (parsebib-clean-TeX-markup "\\foo[bar]{baz} and") "baz and"))
+  (should (equal (parsebib-clean-TeX-markup "\\foo[bar][baz]{boo} and") "boo and"))
+  (should (equal (parsebib-clean-TeX-markup "\\foo[bar][baz]{} and") "\\foo and")))
+
+(ert-deftest parsebib-clean-TeX-markup-braces ()
+  ;; Braces not part of a command should be removed.
+  (should (equal (parsebib-clean-TeX-markup "The {UN} should be all-caps.") "The UN should be all-caps.")))
+
 ;;; parsebib-test.el ends here
author	Joost Kremers <joostkremers@fastmail.fm>	2022-06-16 17:06:19 +0200
committer	Joost Kremers <joostkremers@fastmail.fm>	2022-06-16 17:06:19 +0200
commit	83a77ea7e51f78093b1a0f1eb51615ec9295829d (patch)
tree	fc3b1971c74a3299fcb7900dc379ea9c64c71639
parent	4c65ec2cd316b8e8a0f5d5ceee09eeb59944b8d8 (diff)