summaryrefslogtreecommitdiff
path: root/README
blob: 8b94e8a71197193682d1218df4fd35938b17858e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES

Searching, retrieving, and parsing data from NCBI databases through the Unix command line.

INTRODUCTION

Entrez Direct (EDirect) provides access to the NCBI's Entrez suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.

EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI data services. These can eliminate the need for writing custom software to answer ad hoc questions.

Queries can move seamlessly between EDirect commands and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

PROGRAMMATIC ACCESS

Several underlying network services provide access to different facets of Entrez. These include searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports. The same functionalities are available on the web or when using programmatic methods.

EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.

All EDirect commands are designed to work on large sets of data. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile file:

  export NCBI_API_KEY=user_api_key_goes_here

Each program also has a -help command that prints detailed information about available arguments.

NAVIGATION FUNCTIONS

Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to obtain the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:

  esearch -db pubmed -query "selective serotonin reuptake inhibitor"

Search terms can be also qualified with bracketed field names:

  esearch -db nucleotide -query "insulin [PROT] AND rodents [ORGN]"

Elink looks up precomputed neighbors within a database, or finds associated records in other databases:

  elink -related

  elink -target gene

and for PubMed it can follow references in the NIH Open Citation Collection dataset (see PMID 31600197):

  elink -cited

  elink -cites

Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:

  efilter -molecule genomic -location chloroplast -country sweden -days 365

Efetch downloads selected records or reports in a designated format:

  efetch -format abstract

Individual query commands are connected by a Unix vertical bar pipe symbol:

  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline

DISCOVERY BY ENTREZ NAVIGATION

PubMed related articles are calculated by a statistical algorithm using the title, abstract, and medical subject headings (MeSH terms). These connections between papers can be used for knowledge discovery.

Lycopene cyclase converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme results in 242 articles. Looking up precomputed neighbors returns 14,596 PubMed papers, some of which might be expected to discuss adjacent steps in the biosynthetic pathway:

  esearch -db pubmed -query "lycopene cyclase" |
  elink -related |

Linking to the protein database finds 302,203 sequence records, each of which has standardized organism information from the NCBI taxonomy. Limiting to curated proteins in mice returns 25 records:

  elink -target protein |
  efilter -organism mouse -source refseq |

(Animals do not encode the genes involved in carotene biosynthesis, except for aphids and their ilk, apparently obtained by horizontal gene transfer from fungi.)

Records are then retrieved in FASTA format:

  efetch -format fasta

As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:

  ...
  >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
  MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
  SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
  TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
  KGKSPVKHAEVFCSISSRSLLSPSYYHSFGVTENYVVFLEQPFKLDILKMATAYMRGVSWASCMSFDRED
  KTYIHIIDQRTRKPVPTKFYTDPMVVFHHVNAYEEDGCVLFDVIAYEDSSLYQLFYLANLNKDFEEKSRL
  TSVPTLRRFAVPLHVDKDAEVGSNLVKVSSTTATALKEKDGHVYCQPEVLYEGLELPRINYAYNGKPYRY
  IFAAEVQWSPVPTKILKYDILTKSSLKWSEESCWPAEPLFVPTPGAKDEDDGVILSAIVSTDPQKLPFLL
  ILDAKSFTELARASVDADMHLDLHGLFIPDADWNAVKQTPAETQEVENSDHPTDPTAPELSHSENDFTAG
  HGGSSL
  ...

The entire set of commands runs in 8 seconds. There is no need to write a script to loop over records one at a time or in small groups, or write code to retry after a transient network failure, or add a time delay between requests. All of these features are already built into the EDirect commands.

STRUCTURED DATA EXTRACTION

The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.

Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The following query:

  efetch -db pubmed -id 6271474,1413997,16589597 -format docsum |
  xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name

returns a table with individual author names separated by vertical bars:

  6271474     1981        Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
  1413997     1992 Oct    Mortimer RK|Contopoulou CR|King JS
  16589597    1954 Dec    Garber ED

Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).

NESTED EXPLORATION

Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.

PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:

  esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
  elink -target pubmed | efilter -released last_year | efetch -format xml |
  xtract -pattern PubmedArticle -element MedlineCitation/PMID \
    -block MeshHeading \
      -pfc "\n" -sep "/" -element DescriptorName,QualifierName

retains the proper association of subheadings for each MeSH term:

  30396924
  Age Factors
  Animals
  Cell Cycle Proteins/deficiency/genetics/metabolism
  Cellular Senescence/physiology
  ...

A second level (-subset) would be needed to print major topic attributes next to their parent subheadings.

CONDITIONAL EXECUTION

Conditional processing arguments (-if, -unless, -and, -or, and -else) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:

  esearch -db pubmed -query "Casadaban MJ [AUTH]" |
  efetch -format xml |
  xtract -pattern PubmedArticle -if "#Author" -lt 6 \
    -block Author -if LastName -is-not Casadaban \
      -sep ", " -tab "\n" -element LastName,Initials |
  sort-uniq-count-rank

to select papers with fewer than 6 authors and print a table of the most frequent coauthors:

  11    Chou, J
  8     Cohen, SN
  7     Groisman, EA
  4     Darzins, A
  3     Castilho, BA
  ...

SAVING DATA IN VARIABLES

A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:

  efetch -db pubmed -id 3201829,6301692,781293 -format xml |
  xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -block Author -element "&PMID" \
      -sep " " -tab "\n" -element Initials,LastName

producing a list of authors, with the PubMed Identifier in the first column of each row:

  3201829    JR Johnston
  3201829    CR Contopoulou
  3201829    RK Mortimer
  6301692    MA Krasnow
  6301692    NR Cozzarelli
  781293     MJ Casadaban

The variable can be used even though the original object is no longer visible inside the -block section.

SEQUENCE QUALIFIERS

The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).

The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:

  esearch -db protein -query "conotoxin" -feature mat_peptide |
  efetch -format gpc |
  xtract -insd complete mat_peptide "%peptide" product peptide |
  grep -i conotoxin | sort -t $'\t' -u -k 2,2n

returns the accession, peptide length, name, and sequence for a sample of neurotoxic peptides:

  ADB43131.1    15    conotoxin Cal 1b      LCCKRHHGCHPCGRT
  ADB43128.1    16    conotoxin Cal 5.1     DPAPCCQHPIETCCRR
  AIC77105.1    17    conotoxin Lt1.4       GCCSHPACDVNNPDICG
  ADB43129.1    18    conotoxin Cal 5.2     MIQRSQCCAVKKNCCHVG
  ADD97803.1    20    conotoxin Cal 1.2     AGCCPTIMYKTGACRTNRCR
  AIC77085.1    21    conotoxin Bt14.8      NECDNCMRSFCSMIYEKCRLK
  ADB43125.1    22    conotoxin Cal 14.2    GCPADCPNTCDSSNKCSPGFPG
  AIC77154.1    23    conotoxin Bt14.19     VREKDCPPHPVPGMHKCVCLKTC
  ...

GENES IN A REGION

Suppose a human disease gene has been mapped between two specific markers near the X chromosome centromere, and we want to find all possible candidates for the gene. Genes on the X chromosome can be retrieved with:

  esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
  efilter -status alive -type coding | efetch -format docsum |

Gene names and chromosomal positions are extracted by piping those results to:

  xtract -pattern DocumentSummary -NME Name -DSC Description \
    -block GenomicInfoType -if ChrLoc -equals X \
      -min ChrStart,ChrStop -element "&NME" "&DSC" |

Exploring each GenomicInfoType is needed because of pseudoautosomal regions at the ends of the X and Y chromosomes. Without limiting to chromosome X, the copy of IL9R near the "q" telomere of chromosome Y would be erroneously placed with genes that are near the X chromosome centromere.

Results can now be sorted, filtered, and passed to the between-two-genes script:

  sort -k 1,1n | cut -f 2- |
  grep -v pseudogene | grep -v uncharacterized |
  between-two-genes AMER1 FAAH2

to produce a table of known genes located between two markers:

  FAAH2      fatty acid amide hydrolase 2
  SPIN2A     spindlin family member 2A
  ZXDB       zinc finger X-linked duplicated B
  NLRP2B     NLR family pyrin domain containing 2B
  ZXDA       zinc finger X-linked duplicated A
  SPIN4      spindlin family member 4
  ARHGEF9    Cdc42 guanine nucleotide exchange factor 9
  AMER1      APC membrane recruitment protein 1

GENES IN A PATHWAY

A gene can be linked to a biochemical pathway that utilizes its product:

  esearch -db gene -query "PAH [GENE]" -organism human |
  elink -target biosystems |
  efilter -pathway wikipathways |

Linking from the pathway record back to the gene database:

  elink -target gene |
  efetch -format docsum |
  xtract -pattern DocumentSummary -element Name Description |
  grep -v pseudogene | grep -v uncharacterized |
  sort -f

returns the set of all genes known to be involved in the pathway:

  AANAT      aralkylamine N-acetyltransferase
  ACADM      acyl-CoA dehydrogenase medium chain
  ACHE       acetylcholinesterase (Cartwright blood group)
  ADCYAP1    adenylate cyclase activating polypeptide 1
  ...

RECURSIVE DEFINITIONS

When a recursive object is given to an exploration command:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
  xtract -pattern Taxon -element TaxId ScientificName

selection by -element only examines fields in the outermost objects:

  9606     Homo sapiens
  7227     Drosophila melanogaster
  10090    Mus musculus

The star-slash prefix will descend a single level into the hierarchy:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
  xtract -pattern Taxon -block "*/Taxon" \
    -if Rank -is-not "no rank" \
      -tab "\n" -element TaxId,Rank,ScientificName

to print data on the individual lineage objects:

  2759     superkingdom    Eukaryota
  33208    kingdom         Metazoa
  7711     phylum          Chordata
  89593    subphylum       Craniata
  8287     superclass      Sarcopterygii
  40674    class           Mammalia
  ...

Recursive objects can be fully explored with a double-star-slash prefix:

  esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
  efetch -format xml |
  xtract -pattern Entrezgene -block "**/Gene-commentary" \

Metadata annotated in attributes:

  <Gene-commentary_type value="genomic">1</Gene-commentary_type>

is selected with an "at" sign before the attribute name:

    -if Gene-commentary_type@value -equals genomic \
      -tab "\n" -element Gene-commentary_accession |
  sort | uniq

This prints every genomic accession regardless of nesting depth:

  NC_001666
  X86563
  Z11973

HETEROGENEOUS OBJECTS

A query on curated biological database associations:

  nquire -get http://mygene.info/v3/gene/2652 |
  xtract -j2x -set - -rec GeneRec |

returns a heterogeneous mixture of objects in the pathway section:

  <pathway>
    <reactome>
      <id>R-HSA-162582</id>
      <name>Signal Transduction</name>
    </reactome>
    ...
    <wikipathways>
      <id>WP455</id>
      <name>GPCRs, Class A Rhodopsin-like</name>
    </wikipathways>
  </pathway>

The slash-star suffix is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:

  xtract -pattern GeneRec -group "pathway/*" \
    -pfc "\n" -element "?,name,id"

This displays a table of pathway database references:

  reactome        Signal Transduction                R-HSA-162582
  reactome        Disease                            R-HSA-1643685
  ...
  reactome        Diseases of signal transduction    R-HSA-5663202
  wikipathways    GPCRs, Class A Rhodopsin-like      WP455

LOCAL PUBMED CACHE

Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload all 30 million PubMed records onto an inexpensive external 500 GB solid state drive, using a hierarchy of folders to organize the data for rapid retrieval of any record. For example, PMID 12345678 would be stored (as a compressed XML file) at /Archive/12/34/56/12345678.xml.gz.

Reference the external drive by setting an environment variable in your configuration file:

  export EDIRECT_PUBMED_MASTER=/Volumes/your_disk_name

and run the command:

  index-pubmed

to download the PubMed release files and distribute each record for random access.

Piping a list of 100,000 PMIDs to the fetch-pubmed script, which returns a PubmedArticleSet containing the requested records, takes about 12 seconds. Retrieving those records from NCBI's network service, with efetch -format xml, would take around 40 minutes.

Even moderately large queries can benefit from the local cache. A reverse citation lookup on 191 papers:

  esearch -db pubmed -query "Cozzarelli NR [AUTH]" |
  elink -cited |

takes 7 seconds to match 7134 subsequent articles. Fetching them from the local archive:

  efetch -format uid |
  fetch-pubmed |

is practically instantaneous. Printing the names of each author:

  xtract -pattern PubmedArticle -block Author \
    -sep " " -tab "\n" -element LastName,Initials |

allows creation of a frequency table:

  sort-uniq-count-rank

that lists the authors who most often cited the original papers:

  112    Cozzarelli NR
  73     Maxwell A
  56     Wang JC
  49     Osheroff N
  48     Stasiak A
  ...

Using the network service instead of the local cache would add 2 minutes to the 10 second running time.

LOCAL SEARCH INDEX

A similar divide-and-conquer strategy is used to create a local information retrieval index suitable for large data mining queries.

For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.

For example, the term list that includes "cancer" would be located at /Postings/NORM/c/a/n/c/canc.trm. A query on cancer thus only needs to load a very small subset of the total index. This design allows efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.

The full set of indexed terms, without record counts, can be printed for any field:

  phrase-search -terms NORM

In local queries, a trailing asterisk is used to indicate term truncation:

  phrase-search -count "catabolite repress*"

Using -counts returns expanded terms and individual postings counts:

  phrase-search -counts "catabolite repress*"

Query evaluation includes Boolean operations and parenthetical expressions:

  phrase-search -query "(literacy AND numeracy) NOT (adolescent OR child)"

Adjacent words in the query are treated as a contiguous phrase:

  phrase-search -query "selective serotonin reuptake inhibit*"

More inclusive searches can use words processed by the Porter2 stemming algorithm:

  phrase-search -query "monoamine oxidase inhibitor [STEM]"

Each plus sign will replace a single word inside a phrase:

  phrase-search -query "vitamin c + + common cold"

Runs of tildes indicate the maximum distance between phrases:

  phrase-search -query "vitamin c ~ ~ common cold"

MeSH hierarchy code and year of publication are also indexed:

  phrase-search -query "C14.907.617.812* [TREE] AND 2015:2018 [YEAR]"

An exact match can search for all or part of a title or abstract:

  phrase-search -exact "Genetic Control of Biochemical Reactions in Neurospora."

All query commands return a list of PMIDs, which can be piped directly to fetch-pubmed to retrieve the records. For example:

  phrase-search -query "selective serotonin ~ ~ reuptake inhibitor*" |
  fetch-pubmed |
  xtract -pattern PubmedArticle -num Author |
  sort-uniq-count -n |
  reorder-columns 2 1 |
  head -n 25 |
  tee /dev/tty |
  xy-plot auth.png

performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 11985 PubMed records from the local archive. It then counts the number of authors for each paper, printing a frequency table of the number of papers per number of coauthors:

  0    49
  1    1335
  2    1790
  3    1802
  4    1633
  5    1425
  6    1117
  7    891
  8    584
  9    398
  ...

and creating a visual graph of the data. The entire set of commands runs in under 3 seconds.

NATURAL LANGUAGE PROCESSING RESOURCES

Additional annotation on PubMed can be downloaded and indexed by running:

  index-extras

NCBI's Biomedical Text Mining Group performs computational analysis of PubMed and PMC papers, and extracts chemical, disease, and gene references from the article contents (see PMID 31114887). Along with NLM Gene Reference Into Function mappings (see PMID 14728215), these terms are indexed in CHEM, DISZ, and GENE fields.

Recent research at Stanford defined biological themes, supported by dependency paths, which are indexed as THME and PATH fields. Theme keys are taken from a table in the paper (see PMID 29490008), but disambiguated so themes common to multiple relationships can reside in a single index:

  Chemical-Gene

    A+    agonism, activation
    A-    antagonism, blocking
    Bc    binding, ligand (especially receptors)
    Ec+   increases expression/production
    Ec-   decreases expression/production
    Ec    affects expression/production (neutral)
    N     inhibits

  Gene-Chemical

    O     transport, channels
    K     metabolism, pharmacokinetics
    Z     enzyme activity

  Chemical-Disease

    T     treatment/therapy (including investigatory)
    C     inhibits cell growth (especially cancers)
    Sa    side effect/adverse event
    Pr    prevents, suppresses
    Pa    alleviates, reduces
    Jc    role in disease pathogenesis

  Disease-Chemical

    Mp    biomarkers (of disease progression)

  Gene-Disease

    U     causal mutations
    Ud    mutations affecting disease course
    D     drug targets
    Jg    role in pathogenesis
    Te    possible therapeutic effect
    Y     polymorphisms alter risk
    G     promotes progression

  Disease-Gene

    Md    biomarkers (diagnostic)
    X     overexpression in disease
    L     improper regulation linked to disease

  Gene-Gene

    Bg    binding, ligand (especially receptors)
    W     enhances response
    V+    activates, stimulates
    Eg+   increases expression/production
    Eg    affects expression/production (neutral)
    I     signaling pathway
    H     same protein or complex
    Rg    regulation
    Q     production by cell population

INTEGRATION WITH ENTREZ

The phrase-search -filter command allows UIDs to be generated by an EDirect search and then incorporated as a component in a local query:

  esearch -db pubmed -query "complement system proteins [MESH]" -pub clinical |
  efetch -format uid |
  phrase-search -filter "L [THME] AND D03* [TREE]"

This finds PubMed clinical papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the heterocyclic compounds MeSH chemical code:

  7683550
  19235040
  20587159
  22368276
  24431228
  26151457

Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query.

AUTOMATION AND COMPREHENSIVE EXPLORATION

The phrase-search system can be easily automated. For example, a simple script can walk up the MeSH hierarchy:

  ascend_mesh_tree() {
    var="${1%\*}"
    while :
    do
      phrase-search -count "$var* [TREE]"
      case "$var" in
        *.* ) var="${var%????}" ;;
        *   ) break             ;;
      esac
    done
  }

  ascend_mesh_tree "C14.907.617.812"

from narrower to broader topics, producing counts of records at or below each level:

  6678       c14 907 617 812*
  52001      c14 907 617*
  1618720    c14 907*
  2313378    c14*

Nested "for" loops perform a non-redundant pairwise comparison of themes:

  declare -a THEMES
  THEMES=( A+ A- Bc Bg C D Ec Ec+ Ec- Eg \
           Eg+ G H I Jc Jg K L Md Mp N O Pa \
           Pr Q Rg Sa T Te U Ud V+ W X Y Z )
  declare -a REMAINS
  REMAINS=("${THEMES[@]:1}")

  for fst in ${THEMES[@]}
  do
    num=$(phrase-search -query "$fst [THME]" | wc -l)
    echo -e "$fst\t \t$num"
    for scd in ${REMAINS[@]}
    do
      num=$(phrase-search -query "$fst [THME] AND $scd [THME]" | wc -l)
      echo -e "$fst\t$scd\t$num"
      echo -e "$scd\t$fst\t$num"
    done
    REMAINS=("${REMAINS[@]:1}")
  done | sort | expand -t 7,13

producing a table of co-occurrence counts:

  A+              28322
  A+     A-        1305
  A+     Bc        5634
  A+     Bg        1645
  A+     C         2188
  A+     D          633
  A+     Ec       10364
  ...

Shrinking arrays are used to avoid unnecessary searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.

IDENTIFIER CONVERSION

The index-pubmed script downloads MeSH descriptor files from NLM and creates a conversion file:

  ...
  <Rec>
    <Code>D064007</Code>
    <Name>Ataxia Telangiectasia Mutated Proteins</Name>
    <Tree>D08.811.913.696.620.682.700.097</Tree>
    <Tree>D12.776.157.687.125</Tree>
    <Tree>D12.776.660.720.125</Tree>
  </Rec>
  ...

that can be used for mapping MeSH codes to and from chemical or disease names. For example, running:

  cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
  xtract -pattern Rec \
    -if Name -starts-with "ataxia telangiectasia" \
      -element Code

will return:

  C565779
  C576887
  D001260
  D064007

RAPIDLY SCANNING ALL OF PUBMED

If the expand-current script is run after PubMed indexing, an ad hoc scan can be performed on the entire set of live PubMed records:

  cat "$EDIRECT_PUBMED_MASTER"/Current/*.xml |
  xtract -timer -pattern PubmedArticle \
    -if "#Author" -eq 7 \
      -element MedlineCitation/PMID LastName

in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)

(Note that the data produced by running both index-extras and expand-current may not fit on a 500 GB drive.)

IMPLEMENTATION DETAILS

Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, sending them down a thread-safe communication channel to be distributed among multiple instances of the data exploration and extraction function. On a modern six-core computer, it can process the previous query on all 30 million PubMed records in just under 4 minutes, a sustained rate of over 125,000 records per second.

Rchive is used to build and search the inverted retrieval system. The fetch-pubmed and phrase-search scripts are front-ends to rchive, which is also multi-threaded for speed. It can match several PubMed titles per second, fetching the positional indices for all terms in parallel before evaluating the title words as a contiguous phrase.

Both xtract and rchive are written in Google's compiled Go language. Local archive and search is a completely self-contained, turnkey system, with no need for a novice user to download and configure complicated third-party database software.

EXPLORATION OF EXTERNAL SERVICES

The experimental xplore script expands the EDirect paradigm to navigate connections in the biological resources of the BioThings.io data integration project at Scripps Research (see PMID 23175613). A drug repurposing example:

  xplore -load hgvs "chr6:g.26093141G>A,chr12:g.111351981C>T" |
  xplore -link ncbigene |
  xplore -link wikipathways |
  xplore -link ncbigene |
  xplore -link uniprot |
  xplore -link inchikey |
  xplore -save uid

runs in 18 seconds and returns 1030 chemicals that might act on gene products in two pathways, and would thus be candidates for treating hereditary hemochromatosis or hypertrophic cardiomyopathy. There is initial support in xplore -search for -organism and -action shortcuts, similar to what is available in efilter.

CONVERSION OF JSON TO XML

Consolidated gene information retrieved in JSON format:

  nquire -get http://mygene.info/v3 gene 3043 |

contains a multi-dimensional JSON array of exon coordinates:

  "position": [
    [
      5225463,
      5225726
    ],
    [
      5226576,
      5226799
    ],
    [
      5226929,
      5227071
    ]
  ],

This can be converted to XML with xtract -j2x:

  xtract -j2x -set - -rec GeneRec -nest plural |

using -nest plural to derive a parent name that keeps the internal structure intact in XML:

  <positions>
    <position>5225463</position>
    <position>5225726</position>
  </positions>
  ...

Individual exons can then be visited by piping the record through:

  xtract -pattern GeneRec -group exons -block positions \
    -pfc "\n" -element position

to print a tab-delimited table of start and stop positions:

  5225463    5225726
  5226576    5226799
  5226929    5227071

CONVERSION OF TABLES TO XML

Tab-delimited data can be converted to XML with xtract -t2x:

  nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
  gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
  xtract -t2x -set Set -rec Rec -skip 1 -indent Code Name

which takes command-line arguments of XML tag names for wrapping the entire set, each record, and individual columns:

  <Set>
    <Rec>
      <Code>1246500</Code>
      <Name>repA1</Name>
    </Rec>
    <Rec>
      <Code>1246501</Code>
      <Name>repA2</Name>
    </Rec>
    <Rec>
      <Code>1246502</Code>
      <Name>leuA</Name>
    </Rec>
    ...

FUTURE DIRECTIONS

An iterative search/fetch/extract/compute cycle, with customized local indices, integration of natural language processing results, and no penalties for exhaustive exploration, has the potential for opening up discovery by computation to a larger audience of laboratory biologists without requiring extensive bioinformatics experience.

INSTALLATION

EDirect consists of a set of scripts and programs that are downloaded to the user's computer.

EDirect will run on Unix and Macintosh computers that have the Perl language installed, and under the Cygwin Unix-emulation environment on Windows PCs.

To install the EDirect software, open a terminal window and execute one of the following two commands:

  sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

  sh -c "$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"

If neither curl nor wget are available, see the installation commands in the EDirect web documentation.

This downloads several scripts into an "edirect" folder in the user's home directory. It then fetches any missing Perl modules, and installs platform-specific executables for xtract and rchive.

At the end of this process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:

  export PATH=${PATH}:$HOME/edirect

to the end of your .bash_profile file. If you answer "n", you should then manually edit .bash_profile to add the edirect folder as one of the components of your existing PATH assignment statement.

DOCUMENTATION

Documentation for EDirect is on the web at:

  http://www.ncbi.nlm.nih.gov/books/NBK179288

NCBI database resources are described by:

  https://www.ncbi.nlm.nih.gov/pubmed/31602479

Information on how to obtain an API Key is described in this NCBI blogpost:

  https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities

Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.