summaryrefslogtreecommitdiff
path: root/README
blob: 9f9a53fc896eb95c83113a4df152223a6451fed7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES

Searching, retrieving, and parsing data from NCBI databases through the Unix command line.

INTRODUCTION

Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.

EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI and FTP data services. These can eliminate the need for writing custom software to answer ad hoc questions.

Queries can move seamlessly between EDirect programs and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

PROGRAMMATIC ACCESS

EDirect connects to Entrez through the Entrez Programming Utilities interface. It supports searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports.

EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.

All EDirect programs are designed to work on large sets of data. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile or .zshrc configuration file:

  export NCBI_API_KEY=unique_api_key_goes_here

Each program also has a -help command that prints detailed information about available arguments.

NAVIGATION FUNCTIONS

Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to supply the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:

  esearch -db pubmed -query "selective serotonin reuptake inhibitor"

Search terms can also be qualified with bracketed field names:

  esearch -db pubmed -query "Federhen S [AUTH] AND Nucleic Acids Res [JOUR]"

  esearch -db nuccore -query "insulin [PROT] AND rodents [ORGN]"

Elink looks up precomputed neighbors within a database:

  elink -related

or finds associated records in other databases:

  elink -target gene

Elink also accesses the NIH Open Citation Collection dataset (see PMID 31600197) to follow the reference lists of PubMed records:

  elink -cites

or to find publications that cite the selected PubMed articles:

  elink -cited

Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:

  efilter -molecule genomic -location chloroplast -country sweden -days 365

Efetch downloads selected records or reports in a designated format:

  efetch -format abstract

Individual query commands are connected by a Unix vertical bar pipe symbol:

  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline

DISCOVERY BY NAVIGATION

PubMed related articles are calculated by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. An example of this is finding the last enzymatic step in the vitamin A biosynthetic pathway.

Lycopene cyclase in plants converts lycopene to β-carotene, the immediate dietary precursor of vitamin A. An initial search on the enzyme:

  esearch -db pubmed -query "lycopene cyclase" |

finds 258 articles. Looking up precomputed neighbors:

  elink -related |

returns 15,827 PubMed papers, some of which might be expected to discuss adjacent steps in the pathway. Since beta-carotene is an essential nutrient, we can eliminate all earlier steps by restricting those results to enzymes in animals. To do so, we first link from publications to proteins, all of which are annotated with standardized organism information from the NCBI taxonomy:

  elink -target protein |

This finds 644,215 protein sequence records. Limiting these results to curated proteins in mice:

  efilter -organism mouse -source refseq |

matches only 34 sequences, which is small enough to examine by retrieving the individual records:

  efetch -format fasta

As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:

  ...
  >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
  MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
  SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
  TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
  ...

The entire set of commands runs in 50 seconds. There is no need to use a script to loop over records one at a time, or write code to retry after a transient network failure, or add a time delay between requests. All of these features are already built into the EDirect commands.

XML DATA EXTRACTION

The ability to obtain Entrez records in structured format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.

The xtract program uses command-line arguments to direct the conversion of eXtensible Markup Language data. The -pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name. Explicit paths to objects are not needed.

FORMAT CUSTOMIZATION

By default, the -pattern argument divides the results into rows, while placement of data into columns is controlled by -element, to create a tab-delimited table.

Formatting commands allow extensive customization of the output. The line break between -pattern rows can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep comand is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab command. The following query:

  efetch -db pubmed -id 6271474,6092233,16589597 -format docsum |
  xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name

returns a table with individual author names separated by vertical bars:

  6271474     1981            Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
  6092233     1984 Jul-Aug    Calderon IL|Contopoulou CR|Mortimer RK
  16589597    1954 Dec        Garber ED

The -sep value also applies to distinct -element arguments that are grouped with commas. This can be used to keep data from multiple related fields in the same column:

  -sep " " -element Initials,LastName

The -def command sets a default placeholder to be printed when none of the comma-separated fields in an -element clause are present:

  -def "-" -sep " " -element Year,Month,MedlineDate

ELEMENT VARIANTS

Derivatives of -element were created to eliminate the need for writing post-processing scripts to perform otherwise trivial modifications or analyses on extracted data. They are subdivided into several categories. A representative selection of commands is shown below:

  Positional:       -first, -last

  Numeric:          -num, -len, -inc, -dec, -sum, -min, -max, -avg, -dev, -med

  Text:             -encode, -plain, -upper, -lower, -title, -words, -reverse

  Sequence:         -revcomp, -fasta, -ncbi2na, -molwt

  Coordinates:      -0-based, -1-based, -ucsc-based

  Miscellaneous:    -year, -histogram

EXPLORATION CONTROL

Exploration commands provide fine control over the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see", and allows related fields in an object to be kept together.

In contrast to the simpler DocumentSummary format, records retrieved as PubmedArticle XML:

  efetch -db pubmed -id 1413997 -format xml |

have authors with separate fields for last name and initials:

  <Author>
    <LastName>Mortimer</LastName>
    <Initials>RK</Initials>
  </Author>

Without being given any guidance about context, an -element command on initials and last names:

  xtract -pattern PubmedArticle -element Initials LastName

will explore the current record for each argument in turn, and thus print all author initials followed by all author last names:

  RK    CR    JS    Mortimer    Contopoulou    King

Inserting a -block command redirects data exploration to present each author one at a time:

  xtract -pattern PubmedArticle -block Author -element Initials LastName

Each time through the loop, the -element command only sees the current author's values, which restores the correct association of initials and last names in the output:

  RK    Mortimer    CR    Contopoulou    JS    King

Grouping the two author subfields with a comma, and adjusting the -sep -and -tab values:

  xtract -pattern PubmedArticle \
    -block Author -sep " " -tab ", " -element Initials,LastName

produces a more desirable formatting of author names:

  RK Mortimer, CR Contopoulou, JS King

NESTED EXPLORATION

Exploration command names (-group, -block, and -subset) are assigned to a precedence hierarchy:

  -pattern > -group > -block > -subset > -element

and are combined in ranked order to control object iteration at progressively deeper levels in the XML data structure. Each command argument acts as a "nested for-loop" control variable, retaining information about the context, or state of exploration, at its level.

(Hypothetical) census data would need several nested loops to visit each unique address in context:

  -pattern State -group City -block Street -subset Number -element Resident

MeSH terms can have their own unique set of qualifiers, with a major topic attribute on each object:

  <MeshHeading>
    <DescriptorName MajorTopicYN="N">beta-Galactosidase</DescriptorName>
    <QualifierName MajorTopicYN="Y">genetics</QualifierName>
    <QualifierName MajorTopicYN="N">metabolism</QualifierName>
  </MeshHeading>

Since -element does its own exploration for objects within its current scope, a -block command:

  -block MeshHeading -sep " / " -element DescriptorName,QualifierName

is sufficient for grouping each MeSH name with its qualifiers:

  beta-Galactosidase / genetics / metabolism

Adding -subset commands within the -block visits each individual descriptor or qualifier object on the current MeSH term:

  efetch -db pubmed -id 6162838 -format xml |
  xtract -transform <( echo -e "Y\t*\n" ) \
    -pattern PubmedArticle -element MedlineCitation/PMID \
      -block MeshHeading -clr \
        -subset DescriptorName -plg "\n" -tab "" \
          -translate "@MajorTopicYN" -element DescriptorName \
        -subset QualifierName -plg " / " -tab "" \
          -translate "@MajorTopicYN" -element QualifierName

and keeps major topic attributes associated with their parent objects. The -transform and -translate commands convert the "Y" attribute value to an asterisk for indicating major topics:

  6162838
  Base Sequence
  *DNA, Recombinant
  Escherichia coli / genetics
  ...
  RNA, Messenger / *genetics
  Transcription, Genetic
  beta-Galactosidase / *genetics / metabolism

(Note that "-element MedlineCitation/PMID" uses the parent-slash-child construct to prevent the display of additional PMID items that may be present later in CommentsCorrections objects.)

CONDITIONAL EXECUTION

Conditional processing commands (-if, -unless, -and, -or, and -else) restrict object exploration by data content. These may be used in conjunction with string, numeric, or object constraints:

  esearch -db pubmed -query "Casadaban MJ [AUTH]" |
  efetch -format xml |
  xtract -pattern PubmedArticle -if "#Author" -lt 6 \
    -block Author -if LastName -is-not Casadaban \
      -sep ", " -tab "\n" -element LastName,Initials |
  sort-uniq-count-rank

This will select papers with fewer than 6 authors and print a table of the most frequent coauthors:

  11    Chou, J
  8     Cohen, SN
  7     Groisman, EA
  ...

Numeric constraints can also compare the integer values of two fields. This can be used to find genes that are encoded on the minus strand of a nucleotide sequence:

  -if ChrStart -gt ChrStop

Object constraints will compare the string values of two named fields, and can look for internal inconsistencies between fields whose contents should (in most cases) be identical:

  -if Chromosome -differs-from ChrLoc

The -position command restricts presentation of objects by relative location or index number:

  -block Author -position last -sep ", " -element LastName,Initials

SAVING DATA IN VARIABLES

A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:

  efetch -db pubmed -id 3201829,6301692,781293 -format xml |
  xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -block Author -element "&PMID" \
      -sep " " -tab "\n" -element Initials,LastName

This produces a list of authors, with the PubMed identifier in the first column of each row:

  3201829    JR Johnston
  3201829    CR Contopoulou
  3201829    RK Mortimer
  6301692    MA Krasnow
  6301692    NR Cozzarelli
  781293     MJ Casadaban

The variable can be used even though the original object is no longer visible inside the -block section.

Variables can be initialized with a literal value inside parentheses:

  -block Author -sep " " -tab "" -element "&COM" Initials,LastName -COM "(, )"

They can also be used as arguments in a conditional statement:

  -ABST AbstractText -block Article -if "Journal/Title" -is-within "&ABST"

SEQUENCE QUALIFIERS

The NCBI represents sequence records in a data model based on the central dogma of molecular biology. Each sequence can have multiple features, which contain information about the biology of a given region, including the transformations involved in gene expression. Each feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation, accession of the product sequence, cross-references to external databases).

The data hierarchy is explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, the -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:

  esearch -db protein -query "conotoxin" -feature mat_peptide |
  efetch -format gpc |
  xtract -insd complete mat_peptide "%peptide" product mol_wt peptide |
  grep -i conotoxin | sort -t $'\t' -u -k 2,2n

prints the accession number, peptide length, product name, calculated molecular weight, and sequence for a sample of neurotoxic peptides:

  ADB43131.1    15    conotoxin Cal 1b      1708    LCCKRHHGCHPCGRT
  ADB43128.1    16    conotoxin Cal 5.1     1829    DPAPCCQHPIETCCRR
  AIC77105.1    17    conotoxin Lt1.4       1705    GCCSHPACDVNNPDICG
  ADB43129.1    18    conotoxin Cal 5.2     2008    MIQRSQCCAVKKNCCHVG
  ADD97803.1    20    conotoxin Cal 1.2     2206    AGCCPTIMYKTGACRTNRCR
  ...

GENES IN A REGION

To list all genes between two markers flanking the human X chromosome centromere, first retrieve the protein-coding gene records on that chromosome:

  esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
  efilter -status alive -type coding | efetch -format docsum |

Gene names and chromosomal positions are extracted by piping the records to:

  xtract -pattern DocumentSummary -NAME Name -DESC Description \
    -block GenomicInfoType -if ChrLoc -equals X \
      -min ChrStart,ChrStop -element "&NAME" "&DESC" |

Exploring each GenomicInfoType is needed because of pseudoautosomal regions at the ends of the X and Y chromosomes. Without limiting to chromosome X, the copy of IL9R near the "q" telomere of chromosome Y would be erroneously placed with genes that are near the X chromosome centromere.

Results can now be sorted by position, and then filtered and partitioned:

  sort -k 1,1n | cut -f 2- |
  grep -v pseudogene | grep -v uncharacterized |
  between-two-genes AMER1 FAAH2

to produce a table of known genes located between the two markers:

  FAAH2      fatty acid amide hydrolase 2
  SPIN2A     spindlin family member 2A
  ZXDB       zinc finger X-linked duplicated B
  NLRP2B     NLR family pyrin domain containing 2B
  ZXDA       zinc finger X-linked duplicated A
  SPIN4      spindlin family member 4
  ARHGEF9    Cdc42 guanine nucleotide exchange factor 9
  AMER1      APC membrane recruitment protein 1

GENES IN A PATHWAY

A gene can be linked to the biochemical pathways in which it participates:

  esearch -db gene -query "PAH [GENE]" -organism human |
  elink -target biosystems |
  efilter -pathway wikipathways |

Linking from a pathway record back to the gene database:

  elink -target gene |
  efetch -format docsum |
  xtract -pattern DocumentSummary -element Name Description |
  grep -v pseudogene | grep -v uncharacterized | sort -f

returns the set of all genes known to be involved in the pathway:

  AANAT    aralkylamine N-acetyltransferase
  ACADM    acyl-CoA dehydrogenase medium chain
  ACHE     acetylcholinesterase (Cartwright blood group)
  ...

GENE SEQUENCE

Genes encoded on the minus strand of a sequence:

  esearch -db gene -query "DDT [GENE] AND mouse [ORGN]" |
  efetch -format docsum |
  xtract -pattern GenomicInfoType -element ChrAccVer ChrStart ChrStop |

have coordinates where the start position is greater than the stop:

  NC_000076.6    75773373    75771232

These can be read by a "while" loop:

  while IFS=$'\t' read acn str stp
  do
    efetch -db nuccore -format gb \
      -id "$acn" -chr_start "$str" -chr_stop "$stp"
  done

to return the reverse-complemented subregion in GenBank format:

  LOCUS       NC_000076               2142 bp    DNA     linear   CON 08-AUG-2019
  DEFINITION  Mus musculus strain C57BL/6J chromosome 10, GRCm38.p6 C57BL/6J.
  ACCESSION   NC_000076 REGION: complement(75771233..75773374)
  VERSION     NC_000076.6
  ...
  FEATURES             Location/Qualifiers
       source          1..2142
                       /organism="Mus musculus"
                       /mol_type="genomic DNA"
                       /strain="C57BL/6J"
                       /db_xref="taxon:10090"
                       /chromosome="10"
       gene            1..2142
                       /gene="Ddt"
       mRNA            join(1..159,462..637,1869..2142)
                       /gene="Ddt"
                       /product="D-dopachrome tautomerase"
                       /transcript_id="NM_010027.1"
       CDS             join(52..159,462..637,1869..1941)
                       /gene="Ddt"
                       /codon_start=1
                       /product="D-dopachrome decarboxylase"
                       /protein_id="NP_034157.1"
                       /translation="MPFVELETNLPASRIPAGLENRLCAATATILDKPEDRVSVTIRP
                       GMTLLMNKSTEPCAHLLVSSIGVVGTAEQNRTHSASFFKFLTEELSLDQDRIVIRFFP
                       ...

The reverse complement of a plus-strand sequence range can be selected with efetch -revcomp.

RECURSIVE DEFINITIONS

When a recursively defined object is given to an exploration command:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
  xtract -pattern Taxon -element TaxId ScientificName

the -element command only examines fields in the topmost level:

  9606     Homo sapiens
  7227     Drosophila melanogaster
  10090    Mus musculus

The star-slash-child construct will descend a single level into the hierarchy:

  efetch -db taxonomy -id 9606,7227,10090 -format xml |
  xtract -pattern Taxon -block "*/Taxon" \
    -if Rank -is-not "no rank" \
      -tab "\n" -element TaxId,Rank,ScientificName

to print data on the individual lineage objects:

  2759     superkingdom    Eukaryota
  33208    kingdom         Metazoa
  7711     phylum          Chordata
  89593    subphylum       Craniata
  ...

Recursive objects can be fully explored with a double-star-slash-child construct:

  esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
  efetch -format xml |
  xtract -pattern Entrezgene -block "**/Gene-commentary" \

Metadata annotated in an attribute:

  <Gene-commentary_type value="genomic">1</Gene-commentary_type>

is selected with an "at" sign before the attribute name:

    -if Gene-commentary_type@value -equals genomic \
      -tab "\n" -element Gene-commentary_accession |
  sort | uniq

This prints every genomic accession regardless of nesting depth:

  NC_001666
  X86563
  Z11973

HETEROGENEOUS OBJECTS

The nquire program uses command-line arguments to request data from external CGI or FTP servers. A query on a curated biological database service developed at the Scripps Research Institute (see PMID 23175613):

  nquire -get http://mygene.info/v3/gene/2652 |
  xtract -j2x -set - -rec GeneRec |

returns data containing a heterogeneous mixture of objects in the pathway section:

  <pathway>
    <reactome>
      <id>R-HSA-162582</id>
      <name>Signal Transduction</name>
    </reactome>
    ...
    <wikipathways>
      <id>WP455</id>
      <name>GPCRs, Class A Rhodopsin-like</name>
    </wikipathways>
  </pathway>

The parent-slash-star construct is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:

  xtract -pattern GeneRec -group "pathway/*" \
    -pfc "\n" -element "?,name,id"

This displays a table of pathway database references:

  reactome        Signal Transduction                R-HSA-162582
  reactome        Disease                            R-HSA-1643685
  ...
  reactome        Diseases of the neuronal system    R-HSA-9675143
  wikipathways    GPCRs, Class A Rhodopsin-like      WP455

Xtract -path can explore using multi-level object addresses, delimited by periods or slashes:

  xtract -path pathway.wikipathways.id -tab "\n" -element id

XML NAMESPACES

Namespace prefixes are indicated by a colon, and a leading colon matches any prefix:

  nquire -url "http://webservice.wikipathways.org" getPathway -pwId WP455 |
  xtract -pattern "ns1:getPathwayResponse" -decode ":gpml" |

The -decode command converts Base64-encoded data back to its original binary form. In this case, encoding was used to embed Graphical Pathway Markup Language text inside another XML object:

  xtract -pattern Pathway -block Xref \
    -if @Database -equals "Entrez Gene" \
      -tab "\n" -element @ID

JSON TO XML

Consolidated gene information for human beta-globin retrieved in JavaScript Object Notation format:

  nquire -get http://mygene.info/v3 gene 3043 |

contains a multi-dimensional JSON array of exon coordinates:

  "position": [
    [
      5225463,
      5225726
    ],
    [
      5226576,
      5226799
    ],
    [
      5226929,
      5227071
    ]
  ],

This can be converted to XML with xtract -j2x:

  xtract -j2x -set - -rec GeneRec -nest plural |

using "-nest plural" to derive a parent name that keeps the internal structure intact in the XML:

  <positions>
    <position>5225463</position>
    <position>5225726</position>
  </positions>
  ...

Individual exons can then be visited by piping the record through:

  xtract -pattern GeneRec -group exons \
    -block positions -pfc "\n" -element position

to print a tab-delimited table of start and stop positions:

  5225463    5225726
  5226576    5226799
  5226929    5227071

Similarly, xtract -a2x will convert Abstract Syntax Notation 1 (ASN.1) text files to XML.

TABLES TO XML

Tab-delimited data is easily converted to XML with xtract -t2x:

  nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
  gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
  xtract -t2x -set Set -rec Rec -skip 1 Code Name

This takes a series of command-line arguments with tag names for wrapping the individual columns, and skips the first line of input, which contains header information, to generate a new XML file:

  <Rec>
    <Code>1246500</Code>
    <Name>repA1</Name>
  </Rec>
  <Rec>
    <Code>1246501</Code>
    <Name>repA2</Name>
  </Rec>
  ...

Similarly, xtract -c2x will convert comma-separated values (CSV) files to XML.

GENBANK DOWNLOAD

Nquire can also produce FTP directory listings and save FTP files to a local disk:

  nquire -lst ftp.ncbi.nlm.nih.gov genbank |
  sort -V | grep "^gbvrl" | grep ".seq.gz" |
  tail -n 5 | skip-if-file-exists |
  nquire -dwn ftp.ncbi.nlm.nih.gov genbank

The latest GenPept incremental update file can be parsed into XML with xtract -g2x:

  nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc Last.File |
  sed "s/flat/gnp/g" |
  nquire -ftp ftp.ncbi.nlm.nih.gov genbank daily-nc |
  gunzip -c | xtract -g2x > latest_genpept.xml

Records can be filtered by organism name or taxon identifier, with their accessions saved to a file:

  cat latest_genpept.xml | xtract -insd INSDSeq_organism source db_xref |
  grep -w "taxon:2697049" | cut -f 1 > accn_subset.txt

which can then be used by xtract -select to retain matching records for further processing:

  xtract -input latest_genpept.xml -wrp INSDSet -pattern INSDSeq \
    -select INSDSeq_accession-version -in accn_subset.txt |

such as generating the sequences of individual mature peptides derived from polyproteins:

  xtract -insd mat_peptide product sub_sequence

INDEXED FIELDS

The einfo command can report the fields and links that are indexed for each database:

  einfo -db protein -fields

This will return a table of field abbreviations and names indexed for proteins:

  ACCN    Accession
  ALL     All Fields
  ASSM    Assembly
  AUTH    Author
  BRD     Breed
  CULT    Cultivar
  DIV     Division
  ECNO    EC/RN Number
  FILT    Filter
  FKEY    Feature key
  ...

LOCAL PUBMED CACHE

Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload over 30 million live PubMed records onto an inexpensive external 500 GB solid state drive for rapid retrieval.

For example, PMID 12345678 would be stored (as a compressed XML file) at:

  /Archive/12/34/56/12345678.xml.gz

using a hierarchy of folders to organize the data for random access to any record.

Set an environment variable in your configuration file to reference your external drive:

  export EDIRECT_PUBMED_MASTER=/Volumes/external_disk_name_goes_here

and run:

  archive-pubmed

to download the PubMed release files and distribute each record on the drive. This process will take several hours to complete, but subsequent updates are incremental, and should finish in minutes.

The local archive is a completely self-contained turnkey system, with no need for the user to download and configure complicated third-party database software.

Retrieving a set containing over 120,000 PubMed records from the local archive:

  esearch -db pubmed -query "PNAS [JOUR]" -pub abstract |
  efetch -format uid | stream-pubmed | gunzip -c |

takes about 15 seconds. Retrieving those records from NCBI's network service, with efetch -format xml, would take around 40 minutes.

Even modest sets of PubMed query results can benefit from using the local cache. A reverse citation lookup on 191 papers:

  esearch -db pubmed -query "Cozzarelli NR [AUTH]" | elink -cited |

requires 7 seconds to match 7485 subsequent articles. Fetching them from the local archive:

  efetch -format uid | fetch-pubmed |

is practically instantaneous. Printing the names of all authors in those records:

  xtract -pattern PubmedArticle -block Author \
    -sep " " -tab "\n" -element LastName,Initials |

allows creation of a frequency table:

  sort-uniq-count-rank

that lists the authors who most often cited the original papers:

  113    Cozzarelli NR
  76     Maxwell A
  58     Wang JC
  52     Osheroff N
  52     Stasiak A
  ...

Fetching from the network service would extend the 7 second running time to over 2 minutes.

LOCAL SEARCH INDEX

A similar divide-and-conquer strategy is used to create a local information retrieval system suitable for large data mining queries. Run:

  index-pubmed

to populate retrieval index files from records stored in the local archive. This will also take a few hours. Since PubMed updates are released once per day, it may be most convenient to schedule reindexing to start in the late evening and run during the night.

For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.

For example, the term list that includes "cancer" would be located at:

  /Postings/NORM/c/a/n/c/canc.trm

A query on cancer thus only needs to load a very small subset of the total index. This design combines efficient expression evaluation, unrestricted wildcard truncation, phrase queries, and proximity searches.

The phrase-search script provides access to the local search system.

Currently indexed field names are shown by:

  phrase-search -terms

The full set of indexed terms, without record counts, can be printed for any field:

  phrase-search -terms NORM

In local queries, a trailing asterisk is used to indicate term truncation:

  phrase-search -count "catabolite repress*"

Using -counts returns expanded terms and individual postings counts:

  phrase-search -counts "catabolite repress*"

Query evaluation includes Boolean operations and parenthetical expressions:

  phrase-search -query "(literacy AND numeracy) NOT (adolescent OR child)"

Adjacent words in the query are treated as a contiguous phrase:

  phrase-search -query "selective serotonin reuptake inhibit*"

More inclusive searches can use the Porter2 stemming algorithm:

  phrase-search -query "monoamine oxidase inhibitor [STEM]"

Each plus sign will replace a single word inside a phrase:

  phrase-search -query "vitamin c + + common cold"

Runs of tildes indicate the maximum distance between sequential phrases:

  phrase-search -query "vitamin c ~ ~ common cold"

MeSH identifier code, MeSH hierarchy key, and year of publication are also indexed:

  phrase-search -query "C14.907.617.812* [TREE] AND 2015:2019 [YEAR]"

An exact match can search for all or part of a title or abstract:

  phrase-search -exact "Genetic Control of Biochemical Reactions in Neurospora."

All query commands return a list of PMIDs, which can be piped directly to fetch-pubmed to retrieve the records. For example:

  phrase-search -query "selective serotonin ~ ~ ~ reuptake inhibitor*" |
  fetch-pubmed |
  xtract -pattern PubmedArticle -num Author |
  sort-uniq-count -n |
  reorder-columns 2 1 |
  head -n 25 |
  tee /dev/tty |
  xy-plot auth.png

performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 12,555 PubMed records from the local archive. It then counts the number of authors for each paper, printing a frequency table of the number of papers per number of coauthors:

  0    49
  1    1372
  2    1859
  3    1870
  ...

and creating a visual graph of the data. The entire set of commands runs in under 4 seconds.

The phrase-search and fetch-pubmed scripts are front-ends to the rchive program, which is used to build and search the inverted retrieval system. Rchive is multi-threaded for speed, and can match several PubMed titles per second, fetching the positional indices for all terms in parallel before evaluating the title words as a contiguous phrase.

RAPIDLY SCANNING PUBMED

If the expand-current script is run after archive-pubmed or index-pubmed, an ad hoc scan can be performed on the entire set of live PubMed records:

  cat $EDIRECT_PUBMED_MASTER/Current/*.xml |
  xtract -timer -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -group AuthorList -if "#LastName" -eq 7 -element "&PMID" LastName

in this case finding 1,613,792 articles with seven authors. (Author count is not indexed by Entrez or EDirect. This query excludes consortia and additional named investigators.)

Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, distributing them among multiple instances of the data exploration and extraction function for concurrent execution. On a modern six-core computer with a fast solid state drive, it can process the full set of PubMed records in under 4 minutes.

IDENTIFIER CONVERSION

The index-pubmed script also downloads MeSH descriptor information from the NLM ftp server and generates a conversion file:

  ...
  <Rec>
    <Code>D064007</Code>
    <Name>Ataxia Telangiectasia Mutated Proteins</Name>
    <Tree>D08.811.913.696.620.682.700.097</Tree>
    <Tree>D12.776.157.687.125</Tree>
    <Tree>D12.776.660.720.125</Tree>
  </Rec>
  ...

that can be used for mapping MeSH codes to and from chemical or disease names. For example:

  cat $EDIRECT_PUBMED_MASTER/Data/meshconv.xml |
  xtract -pattern Rec \
    -if Name -starts-with "ataxia telangiectasia" \
      -element Code

will return:

  C565779
  C576887
  D001260
  D064007

The meshconv.xml file is prepared by use of the xtract -wrp command:

  cat desc2020.xml |
  xtract -wrp Set,Rec -pattern DescriptorRecord \
    -wrp Code -element DescriptorRecord/DescriptorUI \
    -wrp Name -first DescriptorName/String \
    -wrp Tree -element TreeNumberList/TreeNumber |
  xtract -format |
  xtract -wrp Set -pattern Rec -sort Code

which wraps element contents in new XML tags by issuing several other formatting commands:

  -pfx "<Tree>" -sep "</Tree><Tree>" -sfx "</Tree>"

and also ensures that data values containing encoded angle brackets, ampersands, quotation marks, or apostrophes, remain properly encoded within the new XML.

NATURAL LANGUAGE PROCESSING

Additional annotation on PubMed can be downloaded and indexed by running:

  index-extras

NCBI's Biomedical Text Mining Group performs computational analysis of PubMed and PMC papers, and extracts chemical, disease, and gene references from article contents (see PMID 31114887). Along with NLM Gene Reference Into Function mappings (see PMID 14728215), these terms are indexed in CHEM, DISZ, and GENE fields.

Recent research at Stanford University defined biological themes, supported by dependency paths, which are indexed in THME, PATH, and CONV fields. Theme keys in the Global Network of Biomedical Relationships are taken from a table in the paper (see PMID 29490008):

  A+    Agonism, activation                      N     Inhibits
  A-    Antagonism, blocking                     O     Transport, channels
  B     Binding, ligand                          Pa    Alleviates, reduces
  C     Inhibits cell growth                     Pr    Prevents, suppresses
  D     Drug targets                             Q     Production by cell population
  E     Affects expression/production            Rg    Regulation
  E+    Increases expression/production          Sa    Side effect/adverse event
  E-    Decreases expression/production          T     Treatment/therapy
  G     Promotes progression                     Te    Possible therapeutic effect
  H     Same protein or complex                  U     Causal mutations
  I     Signaling pathway                        Ud    Mutations affecting disease course
  J     Role in disease pathogenesis             V+    Activates, stimulates
  K     Metabolism, pharmacokinetics             W     Enhances response
  L     Improper regulation linked to disease    X     Overexpression in disease
  Md    Biomarkers (diagnostic)                  Y     Polymorphisms alter risk
  Mp    Biomarkers (progression)                 Z     Enzyme activity

Themes common to multiple chemical-disease-gene relationships are disambiguated so they can be queried individually. The expanded list, along with MeSH category codes, can be seen with:

  phrase-search -help

INTEGRATION WITH ENTREZ

The phrase-search -filter command allows PMIDs to be generated by an EDirect search and then incorporated as a component in a local query:

  esearch -db pubmed -query "complement system proteins [MESH]" |
  efetch -format uid |
  phrase-search -filter "L [THME] AND D10* [TREE]"

This finds PubMed papers about complement proteins and limits them by the "improper regulation linked to disease" theme and the lipids MeSH chemical category:

  448084
  1292783
  1379443
  1467432
  ...

Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a subsequent phrase-search -filter query. They can also be uploaded to the Entrez history server by piping to epost:

  epost -db pubmed

INSTALLATION

EDirect consists of a set of scripts and programs that are downloaded to the user's computer.

To install the EDirect software, open a terminal window and execute one of the following two commands:

  sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

  sh -c "$(wget -q ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"

or follow the detailed installation instructions in the EDirect web documentation.

This downloads several scripts into an "edirect" folder in the user's home directory. It then fetches any missing Perl modules, and installs platform-specific precompiled executables for xtract and rchive.

At the end of this process, the script will ask for permission to add EDirect to your PATH permanently by editing your configuration file. If you answer "y" it will add:

  export PATH=${PATH}:$HOME/edirect

to the end of your .bash_profile file. If you answer "n", you should then manually edit .bash_profile to add the edirect folder as one of the components of your existing PATH assignment statement.

To activate the pre-release version of EDirect, set the following environment variable:

  export USE_NEW_EDIRECT=true

DOCUMENTATION

Documentation for EDirect is on the web at:

  http://www.ncbi.nlm.nih.gov/books/NBK179288

EDirect navigation functions call the URL-based Entrez Programming Utilities:

  https://www.ncbi.nlm.nih.gov/books/NBK25501

NCBI database resources are described by:

  https://www.ncbi.nlm.nih.gov/pubmed/31602479

Information on how to obtain an API Key is described in this NCBI blogpost:

  https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities

Additional sample EDirect queries are available from:

  xtract -examples

Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.

This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH.