summaryrefslogtreecommitdiff
path: root/README
diff options
context:
space:
mode:
authorAaron M. Ucko <ucko@debian.org>2020-04-27 19:49:53 -0400
committerAaron M. Ucko <ucko@debian.org>2020-04-27 19:49:53 -0400
commita312eefb2ebf8cd53bc1fc6df031d5f932744bca (patch)
treeeb90167547aced3f87db0daf31c2b5ac100378e7 /README
parentd621f5b6a7ac4449fb8891383da5eb289c7f10d5 (diff)
New upstream version 12.4.20191028+ds
Diffstat (limited to 'README')
-rw-r--r--README202
1 files changed, 118 insertions, 84 deletions
diff --git a/README b/README
index f2edbea..8b94e8a 100644
--- a/README
+++ b/README
@@ -4,7 +4,7 @@ Searching, retrieving, and parsing data from NCBI databases through the Unix com
INTRODUCTION
-Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
+Entrez Direct (EDirect) provides access to the NCBI's Entrez suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI data services. These can eliminate the need for writing custom software to answer ad hoc questions.
@@ -38,7 +38,7 @@ Elink looks up precomputed neighbors within a database, or finds associated reco
elink -target gene
-For PubMed, elink can also follow reference connections from the NIH Open Citation Collection dataset (see PMID 31600197):
+and for PubMed it can follow references in the NIH Open Citation Collection dataset (see PMID 31600197):
elink -cited
@@ -79,7 +79,7 @@ Records are then retrieved in FASTA format:
As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:
...
- >grep .2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
+ >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
@@ -149,6 +149,8 @@ to select papers with fewer than 6 authors and print a table of the most frequen
11 Chou, J
8 Cohen, SN
7 Groisman, EA
+ 4 Darzins, A
+ 3 Castilho, BA
...
SAVING DATA IN VARIABLES
@@ -175,14 +177,14 @@ SEQUENCE QUALIFIERS
The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).
-The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function is provided for generating the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
+The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
esearch -db protein -query "conotoxin" -feature mat_peptide |
efetch -format gpc |
xtract -insd complete mat_peptide "%peptide" product peptide |
grep -i conotoxin | sort -t $'\t' -u -k 2,2n
-returns the accession, length, name, and sequence for a sample of neurotoxic peptides:
+returns the accession, peptide length, name, and sequence for a sample of neurotoxic peptides:
ADB43131.1 15 conotoxin Cal 1b LCCKRHHGCHPCGRT
ADB43128.1 16 conotoxin Cal 5.1 DPAPCCQHPIETCCRR
@@ -234,7 +236,7 @@ A gene can be linked to a biochemical pathway that utilizes its product:
elink -target biosystems |
efilter -pathway wikipathways |
-Linking from that pathway back to the gene database:
+Linking from the pathway record back to the gene database:
elink -target gene |
efetch -format docsum |
@@ -242,8 +244,8 @@ Linking from that pathway back to the gene database:
grep -v pseudogene | grep -v uncharacterized |
sort -f
-returns all of the genes involved in the pathway:
-
+returns the set of all genes known to be involved in the pathway:
+
AANAT aralkylamine N-acetyltransferase
ACADM acyl-CoA dehydrogenase medium chain
ACHE acetylcholinesterase (Cartwright blood group)
@@ -252,27 +254,37 @@ returns all of the genes involved in the pathway:
RECURSIVE DEFINITIONS
-Gene and taxonomy records contain recursively-defined objects:
+When a recursive object is given to an exploration command:
- esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
- efetch -format xml |
- xtract -outline |
- grep -w Gene-commentary
+ efetch -db taxonomy -id 9606,7227,10090 -format xml |
+ xtract -pattern Taxon -element TaxId ScientificName
+
+selection by -element only examines fields in the outermost objects:
-that can appear at different nesting levels:
+ 9606 Homo sapiens
+ 7227 Drosophila melanogaster
+ 10090 Mus musculus
- Gene-commentary
- Gene-commentary
- Gene-commentary
- Gene-commentary
- Gene-commentary
- Gene-commentary
- Gene-commentary
- ...
+The star-slash prefix will descend a single level into the hierarchy:
+
+ efetch -db taxonomy -id 9606,7227,10090 -format xml |
+ xtract -pattern Taxon -block "*/Taxon" \
+ -if Rank -is-not "no rank" \
+ -tab "\n" -element TaxId,Rank,ScientificName
+
+to print data on the individual lineage objects:
+
+ 2759 superkingdom Eukaryota
+ 33208 kingdom Metazoa
+ 7711 phylum Chordata
+ 89593 subphylum Craniata
+ 8287 superclass Sarcopterygii
+ 40674 class Mammalia
+ ...
-Recursive objects can be individually explored with a double-star-slash prefix:
+Recursive objects can be fully explored with a double-star-slash prefix:
- esearch -db gene -query "DMD [GENE] AND human [ORGN]" |
+ esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
efetch -format xml |
xtract -pattern Entrezgene -block "**/Gene-commentary" \
@@ -282,22 +294,21 @@ Metadata annotated in attributes:
is selected with an "at" sign before the attribute name:
- -tab "\n" -element Gene-commentary_type@value,Gene-commentary_accession
+ -if Gene-commentary_type@value -equals genomic \
+ -tab "\n" -element Gene-commentary_accession |
+ sort | uniq
-This prints every accession and type regardless of nesting depth:
+This prints every genomic accession regardless of nesting depth:
- genomic NC_000023
- mRNA XM_006724469
- peptide XP_006724532
- mRNA XM_011545467
- peptide XP_011543769
- ...
+ NC_001666
+ X86563
+ Z11973
HETEROGENEOUS OBJECTS
A query on curated biological database associations:
- nquire -get "http://mygene.info/v3/gene/2652" |
+ nquire -get http://mygene.info/v3/gene/2652 |
xtract -j2x -set - -rec GeneRec |
returns a heterogeneous mixture of objects in the pathway section:
@@ -316,7 +327,8 @@ returns a heterogeneous mixture of objects in the pathway section:
The slash-star suffix is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:
- xtract -pattern GeneRec -group "pathway/*" -pfc "\n" -element "?,name,id"
+ xtract -pattern GeneRec -group "pathway/*" \
+ -pfc "\n" -element "?,name,id"
This displays a table of pathway database references:
@@ -359,8 +371,7 @@ is practically instantaneous. Printing the names of each author:
allows creation of a frequency table:
- sort-uniq-count-rank |
- head -n 10
+ sort-uniq-count-rank
that lists the authors who most often cited the original papers:
@@ -369,13 +380,9 @@ that lists the authors who most often cited the original papers:
56 Wang JC
49 Osheroff N
48 Stasiak A
- 47 Sherratt DJ
- 45 Berger JM
- 41 Drlica K
- 41 Marko JF
- 36 Hirano T
+ ...
-Using the network service instead of the local cache would add 2 minutes to the running time.
+Using the network service instead of the local cache would add 2 minutes to the 10 second running time.
LOCAL SEARCH INDEX
@@ -393,7 +400,7 @@ In local queries, a trailing asterisk is used to indicate term truncation:
phrase-search -count "catabolite repress*"
-Using -counts instead of -count returns the expanded terms and the individual postings counts:
+Using -counts returns expanded terms and individual postings counts:
phrase-search -counts "catabolite repress*"
@@ -419,7 +426,7 @@ Runs of tildes indicate the maximum distance between phrases:
MeSH hierarchy code and year of publication are also indexed:
- phrase-search -query "C02.782.417* [CODE] AND 2015:2018 [YEAR]"
+ phrase-search -query "C14.907.617.812* [TREE] AND 2015:2018 [YEAR]"
An exact match can search for all or part of a title or abstract:
@@ -464,45 +471,60 @@ Recent research at Stanford defined biological themes, supported by dependency p
Chemical-Gene
- A+ agonism, activation Ec- decreases expression/production
- A- antagonism, blocking Ec affects expression/production (neutral)
- Bc binding, ligand (especially receptors) N inhibits
- Ec+ increases expression/production
+ A+ agonism, activation
+ A- antagonism, blocking
+ Bc binding, ligand (especially receptors)
+ Ec+ increases expression/production
+ Ec- decreases expression/production
+ Ec affects expression/production (neutral)
+ N inhibits
Gene-Chemical
- O transport, channels Z enzyme activity
- K metabolism, pharmacokinetics
+ O transport, channels
+ K metabolism, pharmacokinetics
+ Z enzyme activity
Chemical-Disease
- T treatment/therapy (including investigatory) Pr prevents, suppresses
- C inhibits cell growth (especially cancers) Pa alleviates, reduces
- Sa side effect/adverse event Jc role in disease pathogenesis
+ T treatment/therapy (including investigatory)
+ C inhibits cell growth (especially cancers)
+ Sa side effect/adverse event
+ Pr prevents, suppresses
+ Pa alleviates, reduces
+ Jc role in disease pathogenesis
Disease-Chemical
- Mp biomarkers (of disease progression)
+ Mp biomarkers (of disease progression)
Gene-Disease
- U causal mutations Te possible therapeutic effect
- Ud mutations affecting disease course Y polymorphisms alter risk
- D drug targets G promotes progression
- Jg role in pathogenesis
+ U causal mutations
+ Ud mutations affecting disease course
+ D drug targets
+ Jg role in pathogenesis
+ Te possible therapeutic effect
+ Y polymorphisms alter risk
+ G promotes progression
Disease-Gene
- Md biomarkers (diagnostic) L improper regulation linked to disease
- X overexpression in disease
+ Md biomarkers (diagnostic)
+ X overexpression in disease
+ L improper regulation linked to disease
Gene-Gene
- Bg binding, ligand (especially receptors) I signaling pathway
- W enhances response H same protein or complex
- V+ activates, stimulates Rg regulation
- Eg+ increases expression/production Q production by cell population
- Eg affects expression/production (neutral)
+ Bg binding, ligand (especially receptors)
+ W enhances response
+ V+ activates, stimulates
+ Eg+ increases expression/production
+ Eg affects expression/production (neutral)
+ I signaling pathway
+ H same protein or complex
+ Rg regulation
+ Q production by cell population
INTEGRATION WITH ENTREZ
@@ -510,7 +532,7 @@ The phrase-search -filter command allows UIDs to be generated by an EDirect sear
esearch -db pubmed -query "complement system proteins [MESH]" -pub clinical |
efetch -format uid |
- phrase-search -filter "L [THME] AND D03* [CODE]"
+ phrase-search -filter "L [THME] AND D03* [TREE]"
This finds PubMed clinical papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the heterocyclic compounds MeSH chemical code:
@@ -521,6 +543,8 @@ This finds PubMed clinical papers about complement proteins and limits them by t
24431228
26151457
+Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query.
+
AUTOMATION AND COMPREHENSIVE EXPLORATION
The phrase-search system can be easily automated. For example, a simple script can walk up the MeSH hierarchy:
@@ -529,7 +553,7 @@ The phrase-search system can be easily automated. For example, a simple script c
var="${1%\*}"
while :
do
- phrase-search -count "$var* [CODE]"
+ phrase-search -count "$var* [TREE]"
case "$var" in
*.* ) var="${var%????}" ;;
* ) break ;;
@@ -541,12 +565,12 @@ The phrase-search system can be easily automated. For example, a simple script c
from narrower to broader topics, producing counts of records at or below each level:
- 6665 c14 907 617 812*
- 51840 c14 907 617*
- 1610690 c14 907*
- 2299968 c14*
+ 6678 c14 907 617 812*
+ 52001 c14 907 617*
+ 1618720 c14 907*
+ 2313378 c14*
-Nested for loops perform an exhaustive query of themes paired with every other theme:
+Nested "for" loops perform a non-redundant pairwise comparison of themes:
declare -a THEMES
THEMES=( A+ A- Bc Bg C D Ec Ec+ Ec- Eg \
@@ -579,9 +603,11 @@ producing a table of co-occurrence counts:
A+ Ec 10364
...
+Shrinking arrays are used to avoid unnecessary searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.
+
IDENTIFIER CONVERSION
-The index-pubmed script also downloads MeSH descriptor files from NLM and creates a conversion file:
+The index-pubmed script downloads MeSH descriptor files from NLM and creates a conversion file:
...
<Rec>
@@ -596,7 +622,9 @@ The index-pubmed script also downloads MeSH descriptor files from NLM and create
that can be used for mapping MeSH codes to and from chemical or disease names. For example, running:
cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
- xtract -pattern Rec -if Name -starts-with "ataxia telangiectasia" -element Code
+ xtract -pattern Rec \
+ -if Name -starts-with "ataxia telangiectasia" \
+ -element Code
will return:
@@ -610,11 +638,13 @@ RAPIDLY SCANNING ALL OF PUBMED
If the expand-current script is run after PubMed indexing, an ad hoc scan can be performed on the entire set of live PubMed records:
cat "$EDIRECT_PUBMED_MASTER"/Current/*.xml |
- xtract -timer -pattern PubmedArticle -if "#Author" -eq 7 -element MedlineCitation/PMID LastName
+ xtract -timer -pattern PubmedArticle \
+ -if "#Author" -eq 7 \
+ -element MedlineCitation/PMID LastName
in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)
-(Note that the data produced by running both index-extras and expand-current will not fit on a 500 GB drive.)
+(Note that the data produced by running both index-extras and expand-current may not fit on a 500 GB drive.)
IMPLEMENTATION DETAILS
@@ -640,12 +670,11 @@ runs in 18 seconds and returns 1030 chemicals that might act on gene products in
CONVERSION OF JSON TO XML
-Data retrieved in JSON format can be converted to XML with xtract -j2x:
+Consolidated gene information retrieved in JSON format:
nquire -get http://mygene.info/v3 gene 3043 |
- xtract -j2x -set - -rec GeneRec -nest plural
-This will take a multi-dimensional JSON array of exon coordinates:
+contains a multi-dimensional JSON array of exon coordinates:
"position": [
[
@@ -662,7 +691,11 @@ This will take a multi-dimensional JSON array of exon coordinates:
]
],
-and derive a parent name to keep the nesting structure intact in XML:
+This can be converted to XML with xtract -j2x:
+
+ xtract -j2x -set - -rec GeneRec -nest plural |
+
+using -nest plural to derive a parent name that keeps the internal structure intact in XML:
<positions>
<position>5225463</position>
@@ -672,7 +705,8 @@ and derive a parent name to keep the nesting structure intact in XML:
Individual exons can then be visited by piping the record through:
- xtract -pattern GeneRec -group exons -block positions -pfc "\n" -element position
+ xtract -pattern GeneRec -group exons -block positions \
+ -pfc "\n" -element position
to print a tab-delimited table of start and stop positions:
@@ -707,7 +741,7 @@ which takes command-line arguments of XML tag names for wrapping the entire set,
FUTURE DIRECTIONS
-An iterative search/fetch/extract/compute cycle, with customized local indices, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.
+An iterative search/fetch/extract/compute cycle, with customized local indices, integration of natural language processing results, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.
INSTALLATION
@@ -723,9 +757,9 @@ To install the EDirect software, open a terminal window and execute one of the f
If neither curl nor wget are available, see the installation commands in the EDirect web documentation.
-This downloads several scripts into an "edirect" folder in the user's home directory. The setup.sh script next downloads any missing Perl modules, and then fetches platform-specific executables for xtract and rchive.
+This downloads several scripts into an "edirect" folder in the user's home directory. It then fetches any missing Perl modules, and installs platform-specific executables for xtract and rchive.
-At the end of the installation process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:
+At the end of this process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:
export PATH=${PATH}:$HOME/edirect