New upstream version 12.4.20191028+ds

author: Aaron M. Ucko <ucko@debian.org> 2020-04-27 19:49:53 -0400
committer: Aaron M. Ucko <ucko@debian.org> 2020-04-27 19:49:53 -0400
commit: a312eefb2ebf8cd53bc1fc6df031d5f932744bca (patch)
tree: eb90167547aced3f87db0daf31c2b5ac100378e7 /README
parent: d621f5b6a7ac4449fb8891383da5eb289c7f10d5 (diff)
1 files changed, 118 insertions, 84 deletions
diff --git a/README b/README
index f2edbea..8b94e8a 100644
--- a/README
+++ b/README
@@ -4,7 +4,7 @@ Searching, retrieving, and parsing data from NCBI databases through the Unix com
 
 INTRODUCTION
 
-Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
+Entrez Direct (EDirect) provides access to the NCBI's Entrez suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
 
 EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI data services. These can eliminate the need for writing custom software to answer ad hoc questions.
 
@@ -38,7 +38,7 @@ Elink looks up precomputed neighbors within a database, or finds associated reco
 
   elink -target gene
 
-For PubMed, elink can also follow reference connections from the NIH Open Citation Collection dataset (see PMID 31600197):
+and for PubMed it can follow references in the NIH Open Citation Collection dataset (see PMID 31600197):
 
   elink -cited
 
@@ -79,7 +79,7 @@ Records are then retrieved in FASTA format:
 As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:
 
   ...
-  >grep .2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
+  >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
   MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
   SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
   TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
@@ -149,6 +149,8 @@ to select papers with fewer than 6 authors and print a table of the most frequen
   11    Chou, J
   8     Cohen, SN
   7     Groisman, EA
+  4     Darzins, A
+  3     Castilho, BA
   ...
 
 SAVING DATA IN VARIABLES
@@ -175,14 +177,14 @@ SEQUENCE QUALIFIERS
 
 The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).
 
-The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function is provided for generating the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
+The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
 
   esearch -db protein -query "conotoxin" -feature mat_peptide |
   efetch -format gpc |
   xtract -insd complete mat_peptide "%peptide" product peptide |
   grep -i conotoxin | sort -t $'\t' -u -k 2,2n
 
-returns the accession, length, name, and sequence for a sample of neurotoxic peptides:
+returns the accession, peptide length, name, and sequence for a sample of neurotoxic peptides:
 
   ADB43131.1    15    conotoxin Cal 1b      LCCKRHHGCHPCGRT
   ADB43128.1    16    conotoxin Cal 5.1     DPAPCCQHPIETCCRR
@@ -234,7 +236,7 @@ A gene can be linked to a biochemical pathway that utilizes its product:
   elink -target biosystems |
   efilter -pathway wikipathways |
 
-Linking from that pathway back to the gene database:
+Linking from the pathway record back to the gene database:
 
   elink -target gene |
   efetch -format docsum |
@@ -242,8 +244,8 @@ Linking from that pathway back to the gene database:
   grep -v pseudogene | grep -v uncharacterized |
   sort -f
 
-returns all of the genes involved in the pathway:
- 
+returns the set of all genes known to be involved in the pathway:
+
   AANAT      aralkylamine N-acetyltransferase
   ACADM      acyl-CoA dehydrogenase medium chain
   ACHE       acetylcholinesterase (Cartwright blood group)
@@ -252,27 +254,37 @@ returns all of the genes involved in the pathway:
 
 RECURSIVE DEFINITIONS
 
-Gene and taxonomy records contain recursively-defined objects:
+When a recursive object is given to an exploration command:
 
-  esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
-  efetch -format xml |
-  xtract -outline |
-  grep -w Gene-commentary
+  efetch -db taxonomy -id 9606,7227,10090 -format xml |
+  xtract -pattern Taxon -element TaxId ScientificName
+
+selection by -element only examines fields in the outermost objects:
 
-that can appear at different nesting levels:
+  9606     Homo sapiens
+  7227     Drosophila melanogaster
+  10090    Mus musculus
 
-    Gene-commentary
-        Gene-commentary
-    Gene-commentary
-        Gene-commentary
-            Gene-commentary
-                Gene-commentary
-                    Gene-commentary
-                    ...
+The star-slash prefix will descend a single level into the hierarchy:
+
+  efetch -db taxonomy -id 9606,7227,10090 -format xml |
+  xtract -pattern Taxon -block "*/Taxon" \
+    -if Rank -is-not "no rank" \
+      -tab "\n" -element TaxId,Rank,ScientificName
+
+to print data on the individual lineage objects:
+
+  2759     superkingdom    Eukaryota
+  33208    kingdom         Metazoa
+  7711     phylum          Chordata
+  89593    subphylum       Craniata
+  8287     superclass      Sarcopterygii
+  40674    class           Mammalia
+  ...
 
-Recursive objects can be individually explored with a double-star-slash prefix:
+Recursive objects can be fully explored with a double-star-slash prefix:
 
-  esearch -db gene -query "DMD [GENE] AND human [ORGN]" |
+  esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
   efetch -format xml |
   xtract -pattern Entrezgene -block "**/Gene-commentary" \
 
@@ -282,22 +294,21 @@ Metadata annotated in attributes:
 
 is selected with an "at" sign before the attribute name:
 
-    -tab "\n" -element Gene-commentary_type@value,Gene-commentary_accession
+    -if Gene-commentary_type@value -equals genomic \
+      -tab "\n" -element Gene-commentary_accession |
+  sort | uniq
 
-This prints every accession and type regardless of nesting depth:
+This prints every genomic accession regardless of nesting depth:
 
-  genomic    NC_000023
-  mRNA       XM_006724469
-  peptide    XP_006724532
-  mRNA       XM_011545467
-  peptide    XP_011543769
-  ...
+  NC_001666
+  X86563
+  Z11973
 
 HETEROGENEOUS OBJECTS
 
 A query on curated biological database associations:
 
-  nquire -get "http://mygene.info/v3/gene/2652" |
+  nquire -get http://mygene.info/v3/gene/2652 |
   xtract -j2x -set - -rec GeneRec |
 
 returns a heterogeneous mixture of objects in the pathway section:
@@ -316,7 +327,8 @@ returns a heterogeneous mixture of objects in the pathway section:
 
 The slash-star suffix is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:
 
-  xtract -pattern GeneRec -group "pathway/*" -pfc "\n" -element "?,name,id"
+  xtract -pattern GeneRec -group "pathway/*" \
+    -pfc "\n" -element "?,name,id"
 
 This displays a table of pathway database references:
 
@@ -359,8 +371,7 @@ is practically instantaneous. Printing the names of each author:
 
 allows creation of a frequency table:
 
-  sort-uniq-count-rank |
-  head -n 10
+  sort-uniq-count-rank
 
 that lists the authors who most often cited the original papers:
 
@@ -369,13 +380,9 @@ that lists the authors who most often cited the original papers:
   56     Wang JC
   49     Osheroff N
   48     Stasiak A
-  47     Sherratt DJ
-  45     Berger JM
-  41     Drlica K
-  41     Marko JF
-  36     Hirano T
+  ...
 
-Using the network service instead of the local cache would add 2 minutes to the running time.
+Using the network service instead of the local cache would add 2 minutes to the 10 second running time.
 
 LOCAL SEARCH INDEX
 
@@ -393,7 +400,7 @@ In local queries, a trailing asterisk is used to indicate term truncation:
 
   phrase-search -count "catabolite repress*"
 
-Using -counts instead of -count returns the expanded terms and the individual postings counts:
+Using -counts returns expanded terms and individual postings counts:
 
   phrase-search -counts "catabolite repress*"
 
@@ -419,7 +426,7 @@ Runs of tildes indicate the maximum distance between phrases:
 
 MeSH hierarchy code and year of publication are also indexed:
 
-  phrase-search -query "C02.782.417* [CODE] AND 2015:2018 [YEAR]"
+  phrase-search -query "C14.907.617.812* [TREE] AND 2015:2018 [YEAR]"
 
 An exact match can search for all or part of a title or abstract:
 
@@ -464,45 +471,60 @@ Recent research at Stanford defined biological themes, supported by dependency p
 
   Chemical-Gene
 
-    A+     agonism, activation                            Ec-    decreases expression/production
-    A-     antagonism, blocking                           Ec     affects expression/production (neutral)
-    Bc     binding, ligand (especially receptors)         N      inhibits
-    Ec+    increases expression/production
+    A+    agonism, activation
+    A-    antagonism, blocking
+    Bc    binding, ligand (especially receptors)
+    Ec+   increases expression/production
+    Ec-   decreases expression/production
+    Ec    affects expression/production (neutral)
+    N     inhibits
 
   Gene-Chemical
 
-    O      transport, channels                            Z      enzyme activity
-    K      metabolism, pharmacokinetics
+    O     transport, channels
+    K     metabolism, pharmacokinetics
+    Z     enzyme activity
 
   Chemical-Disease
 
-    T      treatment/therapy (including investigatory)    Pr     prevents, suppresses
-    C      inhibits cell growth (especially cancers)      Pa     alleviates, reduces
-    Sa     side effect/adverse event                      Jc     role in disease pathogenesis
+    T     treatment/therapy (including investigatory)
+    C     inhibits cell growth (especially cancers)
+    Sa    side effect/adverse event
+    Pr    prevents, suppresses
+    Pa    alleviates, reduces
+    Jc    role in disease pathogenesis
 
   Disease-Chemical
 
-    Mp     biomarkers (of disease progression)
+    Mp    biomarkers (of disease progression)
 
   Gene-Disease
 
-    U      causal mutations                               Te     possible therapeutic effect
-    Ud     mutations affecting disease course             Y      polymorphisms alter risk
-    D      drug targets                                   G      promotes progression
-    Jg     role in pathogenesis
+    U     causal mutations
+    Ud    mutations affecting disease course
+    D     drug targets
+    Jg    role in pathogenesis
+    Te    possible therapeutic effect
+    Y     polymorphisms alter risk
+    G     promotes progression
 
   Disease-Gene
 
-    Md     biomarkers (diagnostic)                        L      improper regulation linked to disease
-    X      overexpression in disease
+    Md    biomarkers (diagnostic)
+    X     overexpression in disease
+    L     improper regulation linked to disease
 
   Gene-Gene
 
-    Bg     binding, ligand (especially receptors)         I      signaling pathway
-    W      enhances response                              H      same protein or complex
-    V+     activates, stimulates                          Rg     regulation
-    Eg+    increases expression/production                Q      production by cell population
-    Eg     affects expression/production (neutral)
+    Bg    binding, ligand (especially receptors)
+    W     enhances response
+    V+    activates, stimulates
+    Eg+   increases expression/production
+    Eg    affects expression/production (neutral)
+    I     signaling pathway
+    H     same protein or complex
+    Rg    regulation
+    Q     production by cell population
 
 INTEGRATION WITH ENTREZ
 
@@ -510,7 +532,7 @@ The phrase-search -filter command allows UIDs to be generated by an EDirect sear
 
   esearch -db pubmed -query "complement system proteins [MESH]" -pub clinical |
   efetch -format uid |
-  phrase-search -filter "L [THME] AND D03* [CODE]"
+  phrase-search -filter "L [THME] AND D03* [TREE]"
 
 This finds PubMed clinical papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the heterocyclic compounds MeSH chemical code:
 
@@ -521,6 +543,8 @@ This finds PubMed clinical papers about complement proteins and limits them by t
   24431228
   26151457
 
+Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query.
+
 AUTOMATION AND COMPREHENSIVE EXPLORATION
 
 The phrase-search system can be easily automated. For example, a simple script can walk up the MeSH hierarchy:
@@ -529,7 +553,7 @@ The phrase-search system can be easily automated. For example, a simple script c
     var="${1%\*}"
     while :
     do
-      phrase-search -count "$var* [CODE]"
+      phrase-search -count "$var* [TREE]"
       case "$var" in
         *.* ) var="${var%????}" ;;
         *   ) break             ;;
@@ -541,12 +565,12 @@ The phrase-search system can be easily automated. For example, a simple script c
 
 from narrower to broader topics, producing counts of records at or below each level:
 
-  6665       c14 907 617 812*
-  51840      c14 907 617*
-  1610690    c14 907*
-  2299968    c14*
+  6678       c14 907 617 812*
+  52001      c14 907 617*
+  1618720    c14 907*
+  2313378    c14*
 
-Nested for loops perform an exhaustive query of themes paired with every other theme:
+Nested "for" loops perform a non-redundant pairwise comparison of themes:
 
   declare -a THEMES
   THEMES=( A+ A- Bc Bg C D Ec Ec+ Ec- Eg \
@@ -579,9 +603,11 @@ producing a table of co-occurrence counts:
   A+     Ec       10364
   ...
 
+Shrinking arrays are used to avoid unnecessary searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.
+
 IDENTIFIER CONVERSION
 
-The index-pubmed script also downloads MeSH descriptor files from NLM and creates a conversion file:
+The index-pubmed script downloads MeSH descriptor files from NLM and creates a conversion file:
 
   ...
   <Rec>
@@ -596,7 +622,9 @@ The index-pubmed script also downloads MeSH descriptor files from NLM and create
 that can be used for mapping MeSH codes to and from chemical or disease names. For example, running:
 
   cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
-  xtract -pattern Rec -if Name -starts-with "ataxia telangiectasia" -element Code
+  xtract -pattern Rec \
+    -if Name -starts-with "ataxia telangiectasia" \
+      -element Code
 
 will return:
 
@@ -610,11 +638,13 @@ RAPIDLY SCANNING ALL OF PUBMED
 If the expand-current script is run after PubMed indexing, an ad hoc scan can be performed on the entire set of live PubMed records:
 
   cat "$EDIRECT_PUBMED_MASTER"/Current/*.xml |
-  xtract -timer -pattern PubmedArticle -if "#Author" -eq 7 -element MedlineCitation/PMID LastName
+  xtract -timer -pattern PubmedArticle \
+    -if "#Author" -eq 7 \
+      -element MedlineCitation/PMID LastName
 
 in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)
 
-(Note that the data produced by running both index-extras and expand-current will not fit on a 500 GB drive.)
+(Note that the data produced by running both index-extras and expand-current may not fit on a 500 GB drive.)
 
 IMPLEMENTATION DETAILS
 
@@ -640,12 +670,11 @@ runs in 18 seconds and returns 1030 chemicals that might act on gene products in
 
 CONVERSION OF JSON TO XML
 
-Data retrieved in JSON format can be converted to XML with xtract -j2x:
+Consolidated gene information retrieved in JSON format:
 
   nquire -get http://mygene.info/v3 gene 3043 |
-  xtract -j2x -set - -rec GeneRec -nest plural
 
-This will take a multi-dimensional JSON array of exon coordinates:
+contains a multi-dimensional JSON array of exon coordinates:
 
   "position": [
     [
@@ -662,7 +691,11 @@ This will take a multi-dimensional JSON array of exon coordinates:
     ]
   ],
 
-and derive a parent name to keep the nesting structure intact in XML:
+This can be converted to XML with xtract -j2x:
+
+  xtract -j2x -set - -rec GeneRec -nest plural |
+
+using -nest plural to derive a parent name that keeps the internal structure intact in XML:
 
   <positions>
     <position>5225463</position>
@@ -672,7 +705,8 @@ and derive a parent name to keep the nesting structure intact in XML:
 
 Individual exons can then be visited by piping the record through:
 
-  xtract -pattern GeneRec -group exons -block positions -pfc "\n" -element position
+  xtract -pattern GeneRec -group exons -block positions \
+    -pfc "\n" -element position
 
 to print a tab-delimited table of start and stop positions:
 
@@ -707,7 +741,7 @@ which takes command-line arguments of XML tag names for wrapping the entire set,
 
 FUTURE DIRECTIONS
 
-An iterative search/fetch/extract/compute cycle, with customized local indices, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.
+An iterative search/fetch/extract/compute cycle, with customized local indices, integration of natural language processing results, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.
 
 INSTALLATION
 
@@ -723,9 +757,9 @@ To install the EDirect software, open a terminal window and execute one of the f
 
 If neither curl nor wget are available, see the installation commands in the EDirect web documentation.
 
-This downloads several scripts into an "edirect" folder in the user's home directory. The setup.sh script next downloads any missing Perl modules, and then fetches platform-specific executables for xtract and rchive.
+This downloads several scripts into an "edirect" folder in the user's home directory. It then fetches any missing Perl modules, and installs platform-specific executables for xtract and rchive.
 
-At the end of the installation process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:
+At the end of this process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:
 
   export PATH=${PATH}:$HOME/edirect
author	Aaron M. Ucko <ucko@debian.org>	2020-04-27 19:49:53 -0400
committer	Aaron M. Ucko <ucko@debian.org>	2020-04-27 19:49:53 -0400
commit	a312eefb2ebf8cd53bc1fc6df031d5f932744bca (patch)
tree	eb90167547aced3f87db0daf31c2b5ac100378e7 /README
parent	d621f5b6a7ac4449fb8891383da5eb289c7f10d5 (diff)