1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
|
ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES
Searching, retrieving, and parsing data from NCBI databases through the Unix command line.
INTRODUCTION
Entrez Direct (EDirect) provides access to the NCBI's Entrez suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI data services. These can eliminate the need for writing custom software to answer ad hoc questions.
Queries can move seamlessly between EDirect commands and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
PROGRAMMATIC ACCESS
Several underlying network services provide access to different facets of Entrez. These include searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports. The same functionalities are available on the web or when using programmatic methods.
EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.
All EDirect commands are designed to work on large sets of data. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile file:
export NCBI_API_KEY=user_api_key_goes_here
Each program also has a -help command that prints detailed information about available arguments.
NAVIGATION FUNCTIONS
Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to obtain the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:
esearch -db pubmed -query "selective serotonin reuptake inhibitor"
Search terms can be also qualified with bracketed field names:
esearch -db nucleotide -query "insulin [PROT] AND rodents [ORGN]"
Elink looks up precomputed neighbors within a database, or finds associated records in other databases:
elink -related
elink -target gene
and for PubMed it can follow references in the NIH Open Citation Collection dataset (see PMID 31600197):
elink -cited
elink -cites
Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:
efilter -molecule genomic -location chloroplast -country sweden -days 365
Efetch downloads selected records or reports in a designated format:
efetch -format abstract
Individual query commands are connected by a Unix vertical bar pipe symbol:
esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline
DISCOVERY BY ENTREZ NAVIGATION
PubMed related articles are calculated by a statistical algorithm using the title, abstract, and medical subject headings (MeSH terms). These connections between papers can be used for knowledge discovery.
Lycopene cyclase converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme results in 242 articles. Looking up precomputed neighbors returns 14,596 PubMed papers, some of which might be expected to discuss adjacent steps in the biosynthetic pathway:
esearch -db pubmed -query "lycopene cyclase" |
elink -related |
Linking to the protein database finds 302,203 sequence records, each of which has standardized organism information from the NCBI taxonomy. Limiting to curated proteins in mice returns 25 records:
elink -target protein |
efilter -organism mouse -source refseq |
(Animals do not encode the genes involved in carotene biosynthesis, except for aphids and their ilk, apparently obtained by horizontal gene transfer from fungi.)
Records are then retrieved in FASTA format:
efetch -format fasta
As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:
...
>NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
KGKSPVKHAEVFCSISSRSLLSPSYYHSFGVTENYVVFLEQPFKLDILKMATAYMRGVSWASCMSFDRED
KTYIHIIDQRTRKPVPTKFYTDPMVVFHHVNAYEEDGCVLFDVIAYEDSSLYQLFYLANLNKDFEEKSRL
TSVPTLRRFAVPLHVDKDAEVGSNLVKVSSTTATALKEKDGHVYCQPEVLYEGLELPRINYAYNGKPYRY
IFAAEVQWSPVPTKILKYDILTKSSLKWSEESCWPAEPLFVPTPGAKDEDDGVILSAIVSTDPQKLPFLL
ILDAKSFTELARASVDADMHLDLHGLFIPDADWNAVKQTPAETQEVENSDHPTDPTAPELSHSENDFTAG
HGGSSL
...
The entire set of commands runs in 8 seconds. There is no need to write a script to loop over records one at a time or in small groups, or write code to retry after a transient network failure, or add a time delay between requests. All of these features are already built into the EDirect commands.
STRUCTURED DATA EXTRACTION
The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.
Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The following query:
efetch -db pubmed -id 6271474,1413997,16589597 -format docsum |
xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name
returns a table with individual author names separated by vertical bars:
6271474 1981 Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
1413997 1992 Oct Mortimer RK|Contopoulou CR|King JS
16589597 1954 Dec Garber ED
Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).
NESTED EXPLORATION
Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.
PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:
esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
elink -target pubmed | efilter -released last_year | efetch -format xml |
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block MeshHeading \
-pfc "\n" -sep "/" -element DescriptorName,QualifierName
retains the proper association of subheadings for each MeSH term:
30396924
Age Factors
Animals
Cell Cycle Proteins/deficiency/genetics/metabolism
Cellular Senescence/physiology
...
A second level (-subset) would be needed to print major topic attributes next to their parent subheadings.
CONDITIONAL EXECUTION
Conditional processing arguments (-if, -unless, -and, -or, and -else) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:
esearch -db pubmed -query "Casadaban MJ [AUTH]" |
efetch -format xml |
xtract -pattern PubmedArticle -if "#Author" -lt 6 \
-block Author -if LastName -is-not Casadaban \
-sep ", " -tab "\n" -element LastName,Initials |
sort-uniq-count-rank
to select papers with fewer than 6 authors and print a table of the most frequent coauthors:
11 Chou, J
8 Cohen, SN
7 Groisman, EA
4 Darzins, A
3 Castilho, BA
...
SAVING DATA IN VARIABLES
A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:
efetch -db pubmed -id 3201829,6301692,781293 -format xml |
xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
-block Author -element "&PMID" \
-sep " " -tab "\n" -element Initials,LastName
producing a list of authors, with the PubMed Identifier in the first column of each row:
3201829 JR Johnston
3201829 CR Contopoulou
3201829 RK Mortimer
6301692 MA Krasnow
6301692 NR Cozzarelli
781293 MJ Casadaban
The variable can be used even though the original object is no longer visible inside the -block section.
SEQUENCE QUALIFIERS
The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).
The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:
esearch -db protein -query "conotoxin" -feature mat_peptide |
efetch -format gpc |
xtract -insd complete mat_peptide "%peptide" product peptide |
grep -i conotoxin | sort -t $'\t' -u -k 2,2n
returns the accession, peptide length, name, and sequence for a sample of neurotoxic peptides:
ADB43131.1 15 conotoxin Cal 1b LCCKRHHGCHPCGRT
ADB43128.1 16 conotoxin Cal 5.1 DPAPCCQHPIETCCRR
AIC77105.1 17 conotoxin Lt1.4 GCCSHPACDVNNPDICG
ADB43129.1 18 conotoxin Cal 5.2 MIQRSQCCAVKKNCCHVG
ADD97803.1 20 conotoxin Cal 1.2 AGCCPTIMYKTGACRTNRCR
AIC77085.1 21 conotoxin Bt14.8 NECDNCMRSFCSMIYEKCRLK
ADB43125.1 22 conotoxin Cal 14.2 GCPADCPNTCDSSNKCSPGFPG
AIC77154.1 23 conotoxin Bt14.19 VREKDCPPHPVPGMHKCVCLKTC
...
GENES IN A REGION
Suppose a human disease gene has been mapped between two specific markers near the X chromosome centromere, and we want to find all possible candidates for the gene. Genes on the X chromosome can be retrieved with:
esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
efilter -status alive -type coding | efetch -format docsum |
Gene names and chromosomal positions are extracted by piping those results to:
xtract -pattern DocumentSummary -NME Name -DSC Description \
-block GenomicInfoType -if ChrLoc -equals X \
-min ChrStart,ChrStop -element "&NME" "&DSC" |
Exploring each GenomicInfoType is needed because of pseudoautosomal regions at the ends of the X and Y chromosomes. Without limiting to chromosome X, the copy of IL9R near the "q" telomere of chromosome Y would be erroneously placed with genes that are near the X chromosome centromere.
Results can now be sorted, filtered, and passed to the between-two-genes script:
sort -k 1,1n | cut -f 2- |
grep -v pseudogene | grep -v uncharacterized |
between-two-genes AMER1 FAAH2
to produce a table of known genes located between two markers:
FAAH2 fatty acid amide hydrolase 2
SPIN2A spindlin family member 2A
ZXDB zinc finger X-linked duplicated B
NLRP2B NLR family pyrin domain containing 2B
ZXDA zinc finger X-linked duplicated A
SPIN4 spindlin family member 4
ARHGEF9 Cdc42 guanine nucleotide exchange factor 9
AMER1 APC membrane recruitment protein 1
GENES IN A PATHWAY
A gene can be linked to a biochemical pathway that utilizes its product:
esearch -db gene -query "PAH [GENE]" -organism human |
elink -target biosystems |
efilter -pathway wikipathways |
Linking from the pathway record back to the gene database:
elink -target gene |
efetch -format docsum |
xtract -pattern DocumentSummary -element Name Description |
grep -v pseudogene | grep -v uncharacterized |
sort -f
returns the set of all genes known to be involved in the pathway:
AANAT aralkylamine N-acetyltransferase
ACADM acyl-CoA dehydrogenase medium chain
ACHE acetylcholinesterase (Cartwright blood group)
ADCYAP1 adenylate cyclase activating polypeptide 1
...
RECURSIVE DEFINITIONS
When a recursive object is given to an exploration command:
efetch -db taxonomy -id 9606,7227,10090 -format xml |
xtract -pattern Taxon -element TaxId ScientificName
selection by -element only examines fields in the outermost objects:
9606 Homo sapiens
7227 Drosophila melanogaster
10090 Mus musculus
The star-slash prefix will descend a single level into the hierarchy:
efetch -db taxonomy -id 9606,7227,10090 -format xml |
xtract -pattern Taxon -block "*/Taxon" \
-if Rank -is-not "no rank" \
-tab "\n" -element TaxId,Rank,ScientificName
to print data on the individual lineage objects:
2759 superkingdom Eukaryota
33208 kingdom Metazoa
7711 phylum Chordata
89593 subphylum Craniata
8287 superclass Sarcopterygii
40674 class Mammalia
...
Recursive objects can be fully explored with a double-star-slash prefix:
esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
efetch -format xml |
xtract -pattern Entrezgene -block "**/Gene-commentary" \
Metadata annotated in attributes:
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
is selected with an "at" sign before the attribute name:
-if Gene-commentary_type@value -equals genomic \
-tab "\n" -element Gene-commentary_accession |
sort | uniq
This prints every genomic accession regardless of nesting depth:
NC_001666
X86563
Z11973
HETEROGENEOUS OBJECTS
A query on curated biological database associations:
nquire -get http://mygene.info/v3/gene/2652 |
xtract -j2x -set - -rec GeneRec |
returns a heterogeneous mixture of objects in the pathway section:
<pathway>
<reactome>
<id>R-HSA-162582</id>
<name>Signal Transduction</name>
</reactome>
...
<wikipathways>
<id>WP455</id>
<name>GPCRs, Class A Rhodopsin-like</name>
</wikipathways>
</pathway>
The slash-star suffix is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:
xtract -pattern GeneRec -group "pathway/*" \
-pfc "\n" -element "?,name,id"
This displays a table of pathway database references:
reactome Signal Transduction R-HSA-162582
reactome Disease R-HSA-1643685
...
reactome Diseases of signal transduction R-HSA-5663202
wikipathways GPCRs, Class A Rhodopsin-like WP455
LOCAL PUBMED CACHE
Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload all 30 million PubMed records onto an inexpensive external 500 GB solid state drive, using a hierarchy of folders to organize the data for rapid retrieval of any record. For example, PMID 12345678 would be stored (as a compressed XML file) at /Archive/12/34/56/12345678.xml.gz.
Reference the external drive by setting an environment variable in your configuration file:
export EDIRECT_PUBMED_MASTER=/Volumes/your_disk_name
and run the command:
index-pubmed
to download the PubMed release files and distribute each record for random access.
Piping a list of 100,000 PMIDs to the fetch-pubmed script, which returns a PubmedArticleSet containing the requested records, takes about 12 seconds. Retrieving those records from NCBI's network service, with efetch -format xml, would take around 40 minutes.
Even moderately large queries can benefit from the local cache. A reverse citation lookup on 191 papers:
esearch -db pubmed -query "Cozzarelli NR [AUTH]" |
elink -cited |
takes 7 seconds to match 7134 subsequent articles. Fetching them from the local archive:
efetch -format uid |
fetch-pubmed |
is practically instantaneous. Printing the names of each author:
xtract -pattern PubmedArticle -block Author \
-sep " " -tab "\n" -element LastName,Initials |
allows creation of a frequency table:
sort-uniq-count-rank
that lists the authors who most often cited the original papers:
112 Cozzarelli NR
73 Maxwell A
56 Wang JC
49 Osheroff N
48 Stasiak A
...
Using the network service instead of the local cache would add 2 minutes to the 10 second running time.
LOCAL SEARCH INDEX
A similar divide-and-conquer strategy is used to create a local information retrieval index suitable for large data mining queries.
For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.
For example, the term list that includes "cancer" would be located at /Postings/NORM/c/a/n/c/canc.trm. A query on cancer thus only needs to load a very small subset of the total index. This design allows efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.
The full set of indexed terms, without record counts, can be printed for any field:
phrase-search -terms NORM
In local queries, a trailing asterisk is used to indicate term truncation:
phrase-search -count "catabolite repress*"
Using -counts returns expanded terms and individual postings counts:
phrase-search -counts "catabolite repress*"
Query evaluation includes Boolean operations and parenthetical expressions:
phrase-search -query "(literacy AND numeracy) NOT (adolescent OR child)"
Adjacent words in the query are treated as a contiguous phrase:
phrase-search -query "selective serotonin reuptake inhibit*"
More inclusive searches can use words processed by the Porter2 stemming algorithm:
phrase-search -query "monoamine oxidase inhibitor [STEM]"
Each plus sign will replace a single word inside a phrase:
phrase-search -query "vitamin c + + common cold"
Runs of tildes indicate the maximum distance between phrases:
phrase-search -query "vitamin c ~ ~ common cold"
MeSH hierarchy code and year of publication are also indexed:
phrase-search -query "C14.907.617.812* [TREE] AND 2015:2018 [YEAR]"
An exact match can search for all or part of a title or abstract:
phrase-search -exact "Genetic Control of Biochemical Reactions in Neurospora."
All query commands return a list of PMIDs, which can be piped directly to fetch-pubmed to retrieve the records. For example:
phrase-search -query "selective serotonin ~ ~ reuptake inhibitor*" |
fetch-pubmed |
xtract -pattern PubmedArticle -num Author |
sort-uniq-count -n |
reorder-columns 2 1 |
head -n 25 |
tee /dev/tty |
xy-plot auth.png
performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 11985 PubMed records from the local archive. It then counts the number of authors for each paper, printing a frequency table of the number of papers per number of coauthors:
0 49
1 1335
2 1790
3 1802
4 1633
5 1425
6 1117
7 891
8 584
9 398
...
and creating a visual graph of the data. The entire set of commands runs in under 3 seconds.
NATURAL LANGUAGE PROCESSING RESOURCES
Additional annotation on PubMed can be downloaded and indexed by running:
index-extras
NCBI's Biomedical Text Mining Group performs computational analysis of PubMed and PMC papers, and extracts chemical, disease, and gene references from the article contents (see PMID 31114887). Along with NLM Gene Reference Into Function mappings (see PMID 14728215), these terms are indexed in CHEM, DISZ, and GENE fields.
Recent research at Stanford defined biological themes, supported by dependency paths, which are indexed as THME and PATH fields. Theme keys are taken from a table in the paper (see PMID 29490008), but disambiguated so themes common to multiple relationships can reside in a single index:
Chemical-Gene
A+ agonism, activation
A- antagonism, blocking
Bc binding, ligand (especially receptors)
Ec+ increases expression/production
Ec- decreases expression/production
Ec affects expression/production (neutral)
N inhibits
Gene-Chemical
O transport, channels
K metabolism, pharmacokinetics
Z enzyme activity
Chemical-Disease
T treatment/therapy (including investigatory)
C inhibits cell growth (especially cancers)
Sa side effect/adverse event
Pr prevents, suppresses
Pa alleviates, reduces
Jc role in disease pathogenesis
Disease-Chemical
Mp biomarkers (of disease progression)
Gene-Disease
U causal mutations
Ud mutations affecting disease course
D drug targets
Jg role in pathogenesis
Te possible therapeutic effect
Y polymorphisms alter risk
G promotes progression
Disease-Gene
Md biomarkers (diagnostic)
X overexpression in disease
L improper regulation linked to disease
Gene-Gene
Bg binding, ligand (especially receptors)
W enhances response
V+ activates, stimulates
Eg+ increases expression/production
Eg affects expression/production (neutral)
I signaling pathway
H same protein or complex
Rg regulation
Q production by cell population
INTEGRATION WITH ENTREZ
The phrase-search -filter command allows UIDs to be generated by an EDirect search and then incorporated as a component in a local query:
esearch -db pubmed -query "complement system proteins [MESH]" -pub clinical |
efetch -format uid |
phrase-search -filter "L [THME] AND D03* [TREE]"
This finds PubMed clinical papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the heterocyclic compounds MeSH chemical code:
7683550
19235040
20587159
22368276
24431228
26151457
Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query.
AUTOMATION AND COMPREHENSIVE EXPLORATION
The phrase-search system can be easily automated. For example, a simple script can walk up the MeSH hierarchy:
ascend_mesh_tree() {
var="${1%\*}"
while :
do
phrase-search -count "$var* [TREE]"
case "$var" in
*.* ) var="${var%????}" ;;
* ) break ;;
esac
done
}
ascend_mesh_tree "C14.907.617.812"
from narrower to broader topics, producing counts of records at or below each level:
6678 c14 907 617 812*
52001 c14 907 617*
1618720 c14 907*
2313378 c14*
Nested "for" loops perform a non-redundant pairwise comparison of themes:
declare -a THEMES
THEMES=( A+ A- Bc Bg C D Ec Ec+ Ec- Eg \
Eg+ G H I Jc Jg K L Md Mp N O Pa \
Pr Q Rg Sa T Te U Ud V+ W X Y Z )
declare -a REMAINS
REMAINS=("${THEMES[@]:1}")
for fst in ${THEMES[@]}
do
num=$(phrase-search -query "$fst [THME]" | wc -l)
echo -e "$fst\t \t$num"
for scd in ${REMAINS[@]}
do
num=$(phrase-search -query "$fst [THME] AND $scd [THME]" | wc -l)
echo -e "$fst\t$scd\t$num"
echo -e "$scd\t$fst\t$num"
done
REMAINS=("${REMAINS[@]:1}")
done | sort | expand -t 7,13
producing a table of co-occurrence counts:
A+ 28322
A+ A- 1305
A+ Bc 5634
A+ Bg 1645
A+ C 2188
A+ D 633
A+ Ec 10364
...
Shrinking arrays are used to avoid unnecessary searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.
IDENTIFIER CONVERSION
The index-pubmed script downloads MeSH descriptor files from NLM and creates a conversion file:
...
<Rec>
<Code>D064007</Code>
<Name>Ataxia Telangiectasia Mutated Proteins</Name>
<Tree>D08.811.913.696.620.682.700.097</Tree>
<Tree>D12.776.157.687.125</Tree>
<Tree>D12.776.660.720.125</Tree>
</Rec>
...
that can be used for mapping MeSH codes to and from chemical or disease names. For example, running:
cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
xtract -pattern Rec \
-if Name -starts-with "ataxia telangiectasia" \
-element Code
will return:
C565779
C576887
D001260
D064007
RAPIDLY SCANNING ALL OF PUBMED
If the expand-current script is run after PubMed indexing, an ad hoc scan can be performed on the entire set of live PubMed records:
cat "$EDIRECT_PUBMED_MASTER"/Current/*.xml |
xtract -timer -pattern PubmedArticle \
-if "#Author" -eq 7 \
-element MedlineCitation/PMID LastName
in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)
(Note that the data produced by running both index-extras and expand-current may not fit on a 500 GB drive.)
IMPLEMENTATION DETAILS
Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, sending them down a thread-safe communication channel to be distributed among multiple instances of the data exploration and extraction function. On a modern six-core computer, it can process the previous query on all 30 million PubMed records in just under 4 minutes, a sustained rate of over 125,000 records per second.
Rchive is used to build and search the inverted retrieval system. The fetch-pubmed and phrase-search scripts are front-ends to rchive, which is also multi-threaded for speed. It can match several PubMed titles per second, fetching the positional indices for all terms in parallel before evaluating the title words as a contiguous phrase.
Both xtract and rchive are written in Google's compiled Go language. Local archive and search is a completely self-contained, turnkey system, with no need for a novice user to download and configure complicated third-party database software.
EXPLORATION OF EXTERNAL SERVICES
The experimental xplore script expands the EDirect paradigm to navigate connections in the biological resources of the BioThings.io data integration project at Scripps Research (see PMID 23175613). A drug repurposing example:
xplore -load hgvs "chr6:g.26093141G>A,chr12:g.111351981C>T" |
xplore -link ncbigene |
xplore -link wikipathways |
xplore -link ncbigene |
xplore -link uniprot |
xplore -link inchikey |
xplore -save uid
runs in 18 seconds and returns 1030 chemicals that might act on gene products in two pathways, and would thus be candidates for treating hereditary hemochromatosis or hypertrophic cardiomyopathy. There is initial support in xplore -search for -organism and -action shortcuts, similar to what is available in efilter.
CONVERSION OF JSON TO XML
Consolidated gene information retrieved in JSON format:
nquire -get http://mygene.info/v3 gene 3043 |
contains a multi-dimensional JSON array of exon coordinates:
"position": [
[
5225463,
5225726
],
[
5226576,
5226799
],
[
5226929,
5227071
]
],
This can be converted to XML with xtract -j2x:
xtract -j2x -set - -rec GeneRec -nest plural |
using -nest plural to derive a parent name that keeps the internal structure intact in XML:
<positions>
<position>5225463</position>
<position>5225726</position>
</positions>
...
Individual exons can then be visited by piping the record through:
xtract -pattern GeneRec -group exons -block positions \
-pfc "\n" -element position
to print a tab-delimited table of start and stop positions:
5225463 5225726
5226576 5226799
5226929 5227071
CONVERSION OF TABLES TO XML
Tab-delimited data can be converted to XML with xtract -t2x:
nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
xtract -t2x -set Set -rec Rec -skip 1 -indent Code Name
which takes command-line arguments of XML tag names for wrapping the entire set, each record, and individual columns:
<Set>
<Rec>
<Code>1246500</Code>
<Name>repA1</Name>
</Rec>
<Rec>
<Code>1246501</Code>
<Name>repA2</Name>
</Rec>
<Rec>
<Code>1246502</Code>
<Name>leuA</Name>
</Rec>
...
FUTURE DIRECTIONS
An iterative search/fetch/extract/compute cycle, with customized local indices, integration of natural language processing results, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.
INSTALLATION
EDirect consists of a set of scripts and programs that are downloaded to the user's computer.
EDirect will run on Unix and Macintosh computers that have the Perl language installed, and under the Cygwin Unix-emulation environment on Windows PCs.
To install the EDirect software, open a terminal window and execute one of the following two commands:
sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
sh -c "$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
If neither curl nor wget are available, see the installation commands in the EDirect web documentation.
This downloads several scripts into an "edirect" folder in the user's home directory. It then fetches any missing Perl modules, and installs platform-specific executables for xtract and rchive.
At the end of this process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:
export PATH=${PATH}:$HOME/edirect
to the end of your .bash_profile file. If you answer "n", you should then manually edit .bash_profile to add the edirect folder as one of the components of your existing PATH assignment statement.
DOCUMENTATION
Documentation for EDirect is on the web at:
http://www.ncbi.nlm.nih.gov/books/NBK179288
NCBI database resources are described by:
https://www.ncbi.nlm.nih.gov/pubmed/31602479
Information on how to obtain an API Key is described in this NCBI blogpost:
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities
Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.
|