summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README223
-rwxr-xr-xedirect.pl36
-rwxr-xr-xindex-extras2
-rwxr-xr-xphrase-search135
-rwxr-xr-xpm-index21
-rwxr-xr-xpm-promote4
-rw-r--r--rchive.go3
-rwxr-xr-xtest-eutils42
-rw-r--r--tst-esummary.txt1
-rw-r--r--xtract.go10
10 files changed, 426 insertions, 51 deletions
diff --git a/README b/README
index 8b94e8a..52a0964 100644
--- a/README
+++ b/README
@@ -1,10 +1,10 @@
-ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES
+ENTREZ DIRECT REFERENCE: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES
Searching, retrieving, and parsing data from NCBI databases through the Unix command line.
INTRODUCTION
-Entrez Direct (EDirect) provides access to the NCBI's Entrez suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
+Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
EDirect also includes an argument-driven utility that simplifies the extraction of results in structured XML or JSON format, and a program that builds a URL from command-line arguments for easy access to external CGI data services. These can eliminate the need for writing custom software to answer ad hoc questions.
@@ -95,7 +95,9 @@ The entire set of commands runs in 8 seconds. There is no need to write a script
STRUCTURED DATA EXTRACTION
-The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.
+The ability to obtain Entrez records in structured XML format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.
+
+The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element. Explicit paths to objects are not needed.
Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The following query:
@@ -110,28 +112,87 @@ returns a table with individual author names separated by vertical bars:
Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).
+EXPLORATION OF XML SETS
+
+Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, presenting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.
+
+Records retrieved in PubmedArticle XML format:
+
+ efetch -db pubmed -id 1413997 -format xml |
+
+have authors with separate fields for last name and initials:
+
+ <Author>
+ <LastName>Mortimer</LastName>
+ <Initials>RK</Initials>
+ </Author>
+
+Without being given any guidance about context, an -element statement on initials and last names:
+
+ xtract -pattern PubmedArticle -element Initials LastName
+
+will explore the current record for each argument in turn, and thus print all author initials followed by all author last names:
+
+ RK CR JS Mortimer Contopoulou King
+
+Inserting a -block command redirects data exploration to consider each author one at a time. The subsequent -element statement only sees the current author's values:
+
+ xtract -pattern PubmedArticle -block Author -element Initials LastName
+
+which restores the correct association of initials and last names:
+
+ RK Mortimer CR Contopoulou JS King
+
+Using a comma to combine the two arguments of -element into a group:
+
+ xtract -pattern PubmedArticle -block Author -sep " " -element Initials,LastName
+
+allows -sep to produce a more desirable formatting of author names:
+
+ RK Mortimer CR Contopoulou JS King
+
NESTED EXPLORATION
-Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.
+MeSH terms can have their own unique set of qualifiers, with a major topic attribute on each object:
-PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:
+ <MeshHeading>
+ <DescriptorName MajorTopicYN="N">beta-Galactosidase</DescriptorName>
+ <QualifierName MajorTopicYN="Y">genetics</QualifierName>
+ <QualifierName MajorTopicYN="N">metabolism</QualifierName>
+ </MeshHeading>
- esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
- elink -target pubmed | efilter -released last_year | efetch -format xml |
- xtract -pattern PubmedArticle -element MedlineCitation/PMID \
- -block MeshHeading \
- -pfc "\n" -sep "/" -element DescriptorName,QualifierName
+Since -element does its own exploration for printing object contents, a -block statement:
-retains the proper association of subheadings for each MeSH term:
+ -block MeshHeading -sep " / " -element DescriptorName,QualifierName
- 30396924
- Age Factors
- Animals
- Cell Cycle Proteins/deficiency/genetics/metabolism
- Cellular Senescence/physiology
+is sufficient for grouping each MeSH name with its qualifiers:
+
+ beta-Galactosidase / genetics / metabolism
+
+Visiting each MeSH term with a -block statement, and adding a -subset statement within the -block:
+
+ efetch -db pubmed -id 6162838 -format xml |
+ xtract -transform <( echo -e "Y\t*\n" ) -pattern PubmedArticle \
+ -element MedlineCitation/PMID \
+ -block MeshHeading -clr -plg "\n" -tab "" \
+ -translate DescriptorName@MajorTopicYN -element DescriptorName \
+ -subset QualifierName -plg " / " -tab "" \
+ -translate "@MajorTopicYN" -element QualifierName
+
+is necessary for keeping major topic attributes associated with their parent objects:
+
+ 6162838
+ Base Sequence
+ *DNA, Recombinant
+ Escherichia coli / genetics
...
+ RNA, Messenger / *genetics
+ Transcription, Genetic
+ beta-Galactosidase / *genetics / metabolism
+
+using a text translation function to convert the major topic "Y" value to an asterisk.
-A second level (-subset) would be needed to print major topic attributes next to their parent subheadings.
+(Note that "-element MedlineCitation/PMID" uses the parent-slash-child construct to prevent the display of additional PMID items that may occur later in CommentsCorrections objects.)
CONDITIONAL EXECUTION
@@ -184,7 +245,7 @@ The data hierarchy is easily explored using a -pattern {sequence} -group {featur
xtract -insd complete mat_peptide "%peptide" product peptide |
grep -i conotoxin | sort -t $'\t' -u -k 2,2n
-returns the accession, peptide length, name, and sequence for a sample of neurotoxic peptides:
+returns the accession, peptide length, product name, and sequence for a sample of neurotoxic peptides:
ADB43131.1 15 conotoxin Cal 1b LCCKRHHGCHPCGRT
ADB43128.1 16 conotoxin Cal 5.1 DPAPCCQHPIETCCRR
@@ -198,12 +259,12 @@ returns the accession, peptide length, name, and sequence for a sample of neurot
GENES IN A REGION
-Suppose a human disease gene has been mapped between two specific markers near the X chromosome centromere, and we want to find all possible candidates for the gene. Genes on the X chromosome can be retrieved with:
+To find known genes between two markers flanking the human X chromosome centromere, retrieve the chromosome record with:
esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
efilter -status alive -type coding | efetch -format docsum |
-Gene names and chromosomal positions are extracted by piping those results to:
+Gene names and chromosomal positions are extracted by piping the record to:
xtract -pattern DocumentSummary -NME Name -DSC Description \
-block GenomicInfoType -if ChrLoc -equals X \
@@ -217,7 +278,7 @@ Results can now be sorted, filtered, and passed to the between-two-genes script:
grep -v pseudogene | grep -v uncharacterized |
between-two-genes AMER1 FAAH2
-to produce a table of known genes located between two markers:
+to produce a table of known genes located between the two markers:
FAAH2 fatty acid amide hydrolase 2
SPIN2A spindlin family member 2A
@@ -252,6 +313,57 @@ returns the set of all genes known to be involved in the pathway:
ADCYAP1 adenylate cyclase activating polypeptide 1
...
+GENE SEQUENCE
+
+Genes encoded on the minus strand of a sequence:
+
+ esearch -db gene -query "DDT [GENE] AND mouse [ORGN]" |
+ efetch -format docsum |
+ xtract -pattern GenomicInfoType -element ChrAccVer ChrStart ChrStop |
+
+have coordinates where the start position is greater than the stop:
+
+ NC_000076.6 75773373 75771232
+
+These can be read by a "while" loop:
+
+ while IFS=$'\t' read acn str stp
+ do
+ efetch -db nucleotide -format gb \
+ -id "$acn" -chr_start "$str" -chr_stop "$stp"
+ done
+
+to return the reverse-complemented subregion in GenBank format:
+
+ LOCUS NC_000076 2142 bp DNA linear CON 08-AUG-2019
+ DEFINITION Mus musculus strain C57BL/6J chromosome 10, GRCm38.p6 C57BL/6J.
+ ACCESSION NC_000076 REGION: complement(75771233..75773374)
+ VERSION NC_000076.6
+ ...
+ FEATURES Location/Qualifiers
+ source 1..2142
+ /organism="Mus musculus"
+ /mol_type="genomic DNA"
+ /strain="C57BL/6J"
+ /db_xref="taxon:10090"
+ /chromosome="10"
+ gene 1..2142
+ /gene="Ddt"
+ mRNA join(1..159,462..637,1869..2142)
+ /gene="Ddt"
+ /product="D-dopachrome tautomerase"
+ /transcript_id="NM_010027.1"
+ CDS join(52..159,462..637,1869..1941)
+ /gene="Ddt"
+ /codon_start=1
+ /product="D-dopachrome decarboxylase"
+ /protein_id="NP_034157.1"
+ /translation="MPFVELETNLPASRIPAGLENRLCAATATILDKPEDRVSVTIRP
+ GMTLLMNKSTEPCAHLLVSSIGVVGTAEQNRTHSASFFKFLTEELSLDQDRIVIRFFP
+ ...
+
+The reverse-complement of a plus-strand sequence range can be selected with the efetch -revcomp flag.
+
RECURSIVE DEFINITIONS
When a recursive object is given to an exploration command:
@@ -265,7 +377,7 @@ selection by -element only examines fields in the outermost objects:
7227 Drosophila melanogaster
10090 Mus musculus
-The star-slash prefix will descend a single level into the hierarchy:
+The star-slash-child construct will descend a single level into the hierarchy:
efetch -db taxonomy -id 9606,7227,10090 -format xml |
xtract -pattern Taxon -block "*/Taxon" \
@@ -282,7 +394,7 @@ to print data on the individual lineage objects:
40674 class Mammalia
...
-Recursive objects can be fully explored with a double-star-slash prefix:
+Recursive objects can be fully explored with a double-star-slash-child construct:
esearch -db gene -query "rbcL [GENE] AND maize [ORGN]" |
efetch -format xml |
@@ -325,7 +437,7 @@ returns a heterogeneous mixture of objects in the pathway section:
</wikipathways>
</pathway>
-The slash-star suffix is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:
+The parent-slash-star construct is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:
xtract -pattern GeneRec -group "pathway/*" \
-pfc "\n" -element "?,name,id"
@@ -338,11 +450,32 @@ This displays a table of pathway database references:
reactome Diseases of signal transduction R-HSA-5663202
wikipathways GPCRs, Class A Rhodopsin-like WP455
+INDEXED FIELDS
+
+Entrez can report the fields and links that are indexed for each database. For example:
+
+ einfo -db protein -fields
+
+will return a table of field abbreviations and names indexed for proteins:
+
+ ACCN Accession
+ ALL All Fields
+ ASSM Assembly
+ AUTH Author
+ BRD Breed
+ CULT Cultivar
+ DIV Division
+ ECNO EC/RN Number
+ FILT Filter
+ FKEY Feature key
+ GENE Gene Name
+ ...
+
LOCAL PUBMED CACHE
Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor. EDirect can now preload all 30 million PubMed records onto an inexpensive external 500 GB solid state drive, using a hierarchy of folders to organize the data for rapid retrieval of any record. For example, PMID 12345678 would be stored (as a compressed XML file) at /Archive/12/34/56/12345678.xml.gz.
-Reference the external drive by setting an environment variable in your configuration file:
+Set an environment variable in your configuration file to reference the external drive:
export EDIRECT_PUBMED_MASTER=/Volumes/your_disk_name
@@ -359,7 +492,7 @@ Even moderately large queries can benefit from the local cache. A reverse citati
esearch -db pubmed -query "Cozzarelli NR [AUTH]" |
elink -cited |
-takes 7 seconds to match 7134 subsequent articles. Fetching them from the local archive:
+takes 5 seconds to match 7134 subsequent articles. Fetching them from the local archive:
efetch -format uid |
fetch-pubmed |
@@ -382,7 +515,7 @@ that lists the authors who most often cited the original papers:
48 Stasiak A
...
-Using the network service instead of the local cache would add 2 minutes to the 10 second running time.
+Using the network service instead of the local cache would extend the 7 second running time by 2 minutes.
LOCAL SEARCH INDEX
@@ -547,7 +680,7 @@ Intermediate lists of PMIDs can be saved to a file and piped (with "cat") into a
AUTOMATION AND COMPREHENSIVE EXPLORATION
-The phrase-search system can be easily automated. For example, a simple script can walk up the MeSH hierarchy:
+The phrase-search system is easy to automate. For example, a small function can walk up the MeSH hierarchy:
ascend_mesh_tree() {
var="${1%\*}"
@@ -603,7 +736,7 @@ producing a table of co-occurrence counts:
A+ Ec 10364
...
-Shrinking arrays are used to avoid unnecessary searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.
+Shrinking arrays are used to avoid unnecessary equivalent searches, e.g., querying both "A+ AND Ec" and "Ec AND A+", though each result is reported in both directions.
IDENTIFIER CONVERSION
@@ -633,19 +766,27 @@ will return:
D001260
D064007
+The meshconv.xml file is prepared by use of the xtract -wrp command, which wraps element contents in new XML tags:
+
+ cat desc2020.xml |
+ xtract -wrp Set,Rec -pattern DescriptorRecord \
+ -wrp Code -element "DescriptorRecord/DescriptorUI" \
+ -wrp Name -first "DescriptorName/String" \
+ -wrp Tree -element "TreeNumberList/TreeNumber" |
+ xtract -format |
+ xtract -wrp Set -pattern Rec -sort Code
+
RAPIDLY SCANNING ALL OF PUBMED
If the expand-current script is run after PubMed indexing, an ad hoc scan can be performed on the entire set of live PubMed records:
- cat "$EDIRECT_PUBMED_MASTER"/Current/*.xml |
+ cat $EDIRECT_PUBMED_MASTER/Current/*.xml |
xtract -timer -pattern PubmedArticle \
-if "#Author" -eq 7 \
-element MedlineCitation/PMID LastName
in this case finding articles with seven authors. (Author count is not indexed by Entrez or locally by EDirect.)
-(Note that the data produced by running both index-extras and expand-current may not fit on a 500 GB drive.)
-
IMPLEMENTATION DETAILS
Xtract uses the Boyer-Moore-Horspool algorithm to partition an XML stream into individual records, sending them down a thread-safe communication channel to be distributed among multiple instances of the data exploration and extraction function. On a modern six-core computer, it can process the previous query on all 30 million PubMed records in just under 4 minutes, a sustained rate of over 125,000 records per second.
@@ -668,6 +809,10 @@ The experimental xplore script expands the EDirect paradigm to navigate connecti
runs in 18 seconds and returns 1030 chemicals that might act on gene products in two pathways, and would thus be candidates for treating hereditary hemochromatosis or hypertrophic cardiomyopathy. There is initial support in xplore -search for -organism and -action shortcuts, similar to what is available in efilter.
+As part of this development, xtract gained a -path exploration argument and support for multi-level exploration paths:
+
+ -path pathway.wikipathways.id -tab "\n" -element id
+
CONVERSION OF JSON TO XML
Consolidated gene information retrieved in JSON format:
@@ -705,8 +850,8 @@ using -nest plural to derive a parent name that keeps the internal structure int
Individual exons can then be visited by piping the record through:
- xtract -pattern GeneRec -group exons -block positions \
- -pfc "\n" -element position
+ xtract -pattern GeneRec -group exons \
+ -block positions -pfc "\n" -element position
to print a tab-delimited table of start and stop positions:
@@ -716,13 +861,13 @@ to print a tab-delimited table of start and stop positions:
CONVERSION OF TABLES TO XML
-Tab-delimited data can be converted to XML with xtract -t2x:
+Tab-delimited data is easily converted to XML with xtract -t2x:
nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
- xtract -t2x -set Set -rec Rec -skip 1 -indent Code Name
+ xtract -t2x -set Set -rec Rec -skip 1 Code Name
-which takes command-line arguments of XML tag names for wrapping the entire set, each record, and individual columns:
+which takes a series of command-line arguments with XML tag names for wrapping the individual columns:
<Set>
<Rec>
@@ -771,6 +916,10 @@ Documentation for EDirect is on the web at:
http://www.ncbi.nlm.nih.gov/books/NBK179288
+EDirect navigation functions call the Entrez Programming Utilities:
+
+ https://www.ncbi.nlm.nih.gov/books/NBK25501
+
NCBI database resources are described by:
https://www.ncbi.nlm.nih.gov/pubmed/31602479
diff --git a/edirect.pl b/edirect.pl
index 8214742..b084c91 100755
--- a/edirect.pl
+++ b/edirect.pl
@@ -183,6 +183,7 @@ sub clearflags {
$raw = false;
$related = false;
$result = 0;
+ $revcomp = false;
$rldate = 0;
$seq_start = 0;
$seq_stop = 0;
@@ -2155,7 +2156,7 @@ sub esmry {
}
if (! $raw) {
- if ($data !~ /<Id>\d+<\/Id>/i) {
+ if ($data !~ /<Id>\d+<\/Id>/) {
$data =~ s/<DocumentSummary uid=\"(\d+)\">/<DocumentSummary><Id>$1<\/Id>/g;
}
if ( $dbase eq "gtr" ) {
@@ -2296,7 +2297,7 @@ sub esmry {
}
if (! $raw) {
- if ($data !~ /<Id>\d+<\/Id>/i) {
+ if ($data !~ /<Id>\d+<\/Id>/) {
$data =~ s/<DocumentSummary uid=\"(\d+)\">/<DocumentSummary><Id>$1<\/Id>/g;
}
if ( $dbase eq "gtr" ) {
@@ -2342,7 +2343,8 @@ Sequence Range
-seq_start First sequence position to retrieve
-seq_stop Last sequence position to retrieve
- -strand Strand of DNA to retrieve
+ -strand 1 = forward DNA strand, 2 = reverse complement
+ -revcomp Shortcut for strand 2
Gene Range
@@ -2676,6 +2678,7 @@ sub eftch {
"seq_start=i" => \$seq_start,
"seq_stop=i" => \$seq_stop,
"strand=s" => \$strand,
+ "revcomp" => \$revcomp,
"complexity=i" => \$complexity,
"chr_start=i" => \$chr_start,
"chr_stop=i" => \$chr_stop,
@@ -2788,10 +2791,13 @@ sub eftch {
$email = $emaddr;
}
+ if ( $revcomp ) {
+ $strand = "2";
+ }
if ( $strand eq "plus" or $strand eq "+" ) {
$strand = "1";
}
- if ( $strand eq "minus" or $strand eq "-" ) {
+ if ( $strand eq "minus" or $strand eq "-" or $strand eq "revcomp" ) {
$strand = "2";
}
@@ -3587,6 +3593,7 @@ sub einfo {
my $menu = "";
if ( $fields ) {
+ my @unsorted = ();
my @flds = ($output =~ /<Field>(.+?)<\/Field>/g);
foreach $fld (@flds) {
$name = "";
@@ -3598,12 +3605,17 @@ sub einfo {
$full = $1;
}
if ( $name ne "" and $full ne "" ) {
- print "$name\t$full\n";
+ push (@unsorted, "$name\t$full\n");
}
}
+ my @sorted = sort { "\U$a" cmp "\U$b" } @unsorted;
+ foreach $itm (@sorted) {
+ print "$itm";
+ }
}
if ( $links ) {
+ my @unsorted = ();
my @lnks = ($output =~ /<Link>(.+?)<\/Link>/g);
foreach $lnk (@lnks) {
$name = "";
@@ -3615,9 +3627,13 @@ sub einfo {
$menu = $1;
}
if ( $name ne "" and $menu ne "" ) {
- print "$name\t$menu\n";
+ push (@unsorted, "$name\t$menu\n");
}
}
+ my @sorted = sort { "\U$a" cmp "\U$b" } @unsorted;
+ foreach $itm (@sorted) {
+ print "$itm";
+ }
}
return;
@@ -5827,7 +5843,7 @@ sub ftcp {
$ftp->cwd($dir) or die "Unable to change to $dir: ", $ftp->message;
$ftp->binary or warn "Unable to set binary mode: ", $ftp->message;
- if ($max > 1) {
+ if ($max > 2) {
# file names on command line
for ( $i = 2; $i < $max; $i++) {
my $fl = $args[$i];
@@ -5839,9 +5855,11 @@ sub ftcp {
}
}
}
+ } elsif ( -t STDIN ) {
+ print STDERR "\nNO INPUT PIPED FROM STDIN\n\n";
} else {
# read file names from stdin
- while (<> ) {
+ while ( <STDIN> ) {
chomp;
$_ =~ s/\r$//;
print "$_\n";
@@ -6105,7 +6123,7 @@ sub tmut {
$data =~ s/> +</></g;
# move UID from attribute to object
- if ($data !~ /<Id>\d+<\/Id>/i) {
+ if ($data !~ /<Id>\d+<\/Id>/) {
$data =~ s/<DocumentSummary uid=\"(\d+)\">/<DocumentSummary><Id>$1<\/Id>/g;
}
$data =~ s/<DocumentSummary uid=\"\d+\">/<DocumentSummary>/g;
diff --git a/index-extras b/index-extras
index c241b80..25213bf 100755
--- a/index-extras
+++ b/index-extras
@@ -261,7 +261,7 @@ PST=$seconds
seconds_start=$(date "+%s")
cd "$WORKING/Indexed"
# echo "Populating Link Archive"
-# rchive -distribute "$MASTER/Archive" *.e2x
+# rchive -distribute "$MASTER/Archive" *.e2x.gz
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
# echo "$seconds seconds"
diff --git a/phrase-search b/phrase-search
index 03a8ff8..f4f2021 100755
--- a/phrase-search
+++ b/phrase-search
@@ -115,6 +115,8 @@ EXAMPLES:
phrase-search -exact "Genetic Control of Biochemical Reactions in Neurospora."
+AUTOMATION:
+
ascend_mesh_tree() {
var="${1%\*}"
while :
@@ -129,6 +131,139 @@ EXAMPLES:
ascend_mesh_tree "C14.907.617.812"
+ declare -a THEMES
+ THEMES=( A+ A- Bc Bg C D Ec Ec+ Ec- Eg \\
+ Eg+ G H I Jc Jg K L Md Mp N O Pa \\
+ Pr Q Rg Sa T Te U Ud V+ W X Y Z )
+ declare -a REMAINS
+ REMAINS=("${THEMES[@]:1}")
+
+ for fst in ${THEMES[@]}
+ do
+ num=$(phrase-search -query "$fst [THME]" | wc -l)
+ echo -e "$fst\t \t$num"
+ for scd in ${REMAINS[@]}
+ do
+ num=$(phrase-search -query "$fst [THME] AND $scd [THME]" | wc -l)
+ echo -e "$fst\t$scd\t$num"
+ echo -e "$scd\t$fst\t$num"
+ done
+ REMAINS=("${REMAINS[@]:1}")
+ done | sort | expand -t 7,13
+
+ENTREZ INTEGRATION
+
+ esearch -db pubmed -query "complement system proteins [MESH]" -pub clinical |
+ efetch -format uid |
+ phrase-search -filter "L [THME] AND D03* [TREE]"
+
+MESH DISEASES
+
+ C01 – bacterial infections and mycoses
+ C02 – virus diseases
+ C03 – parasitic diseases
+ C04 – neoplasms
+ C05 – musculoskeletal diseases
+ C06 – digestive system diseases
+ C07 – stomatognathic diseases
+ C08 – respiratory tract diseases
+ C09 – otorhinolaryngologic diseases
+ C10 – nervous system diseases
+ C11 – eye diseases
+ C12 – male urogenital diseases
+ C13 – female urogenital diseases and pregnancy complications
+ C14 – cardiovascular diseases
+ C15 – hemic and lymphatic diseases
+ C16 – congenital, hereditary, and neonatal diseases and abnormalities
+ C17 – skin and connective tissue diseases
+ C18 – nutritional and metabolic diseases
+ C19 – endocrine system diseases
+ C20 – immune system diseases
+ C21 – disorders of environmental origin
+ C22 – animal diseases
+ C23 – pathological conditions, signs and symptoms
+ C24 - occupational diseases
+ C25 - chemically-induced disorders
+ C26 - wounds and injuries
+
+MESH CHEMICALS AND DRUGS
+
+ D01 – inorganic chemicals
+ D02 – organic chemicals
+ D03 – heterocyclic compounds
+ D04 – polycyclic compounds
+ D05 – macromolecular substances
+ D06 – hormones, hormone substitutes, and hormone antagonists
+ D08 – enzymes and coenzymes
+ D09 – carbohydrates
+ D10 – lipids
+ D12 – amino acids, peptides, and proteins
+ D13 – nucleic acids, nucleotides, and nucleosides
+ D20 – complex mixtures
+ D23 – biological factors
+ D25 – biomedical and dental materials
+ D26 – pharmaceutical preparations
+ D27 – chemical actions and uses
+
+THEME CODES:
+
+Chemical-Gene
+
+ A+ agonism, activation
+ A- antagonism, blocking
+ Bc binding, ligand (especially receptors)
+ Ec+ increases expression/production
+ Ec- decreases expression/production
+ Ec affects expression/production (neutral)
+ N inhibits
+
+Gene-Chemical
+
+ O transport, channels
+ K metabolism, pharmacokinetics
+ Z enzyme activity
+
+Chemical-Disease
+
+ T treatment/therapy (including investigatory)
+ C inhibits cell growth (especially cancers)
+ Sa side effect/adverse event
+ Pr prevents, suppresses
+ Pa alleviates, reduces
+ Jc role in disease pathogenesis
+
+Disease-Chemical
+
+ Mp biomarkers (of disease progression)
+
+Gene-Disease
+
+ U causal mutations
+ Ud mutations affecting disease course
+ D drug targets
+ Jg role in pathogenesis
+ Te possible therapeutic effect
+ Y polymorphisms alter risk
+ G promotes progression
+
+Disease-Gene
+
+ Md biomarkers (diagnostic)
+ X overexpression in disease
+ L improper regulation linked to disease
+
+Gene-Gene
+
+ Bg binding, ligand (especially receptors)
+ W enhances response
+ V+ activates, stimulates
+ Eg+ increases expression/production
+ Eg affects expression/production (neutral)
+ I signaling pathway
+ H same protein or complex
+ Rg regulation
+ Q production by cell population
+
EOF
exit
fi
diff --git a/pm-index b/pm-index
index 3255da9..70b6db1 100755
--- a/pm-index
+++ b/pm-index
@@ -45,3 +45,24 @@ do
echo "$seconds seconds"
sleep 1
done
+
+for fl in *.xml
+do
+ base=${fl%.xml}
+ echo "$base.e2x"
+ seconds_start=$(date "+%s")
+ if [ -s "$data/meshtree.txt" ]
+ then
+ cat "$fl" |
+ xtract -transform "$data/meshtree.txt" -e2index |
+ gzip -1 > "$target/$base.e2x.gz"
+ else
+ cat "$fl" |
+ xtract -e2index |
+ gzip -1 > "$target/$base.e2x.gz"
+ fi
+ seconds_end=$(date "+%s")
+ seconds=$((seconds_end - seconds_start))
+ echo "$seconds seconds"
+ sleep 1
+done
diff --git a/pm-promote b/pm-promote
index 014ee9e..c98ead9 100755
--- a/pm-promote
+++ b/pm-promote
@@ -20,6 +20,7 @@ target=${target%/}
for fld in NORM STEM YEAR CODE TREE
do
+ seconds_start=$(date "+%s")
echo "$fld"
find "." -name "*.mrg.gz" |
sort |
@@ -28,4 +29,7 @@ do
do
rchive -promote "$target" "$fld" $files
done
+ seconds_end=$(date "+%s")
+ seconds=$((seconds_end - seconds_start))
+ echo "($fld $seconds seconds)"
done
diff --git a/rchive.go b/rchive.go
index e3d59cc..035177f 100644
--- a/rchive.go
+++ b/rchive.go
@@ -1012,7 +1012,8 @@ func MakePostingsTrie(str string, arry [516]rune) string {
// trieLen directory depth parameters are based on the observed size distribution of PubMed indices
var trieLen = map[string]int{
- "20": 3,
+ "19": 4,
+ "20": 4,
"ac": 4,
"af": 4,
"an": 4,
diff --git a/test-eutils b/test-eutils
index 36033f7..fe643e6 100755
--- a/test-eutils
+++ b/test-eutils
@@ -391,6 +391,48 @@ DoSummary() {
DoTime
done
done < "$dir/tst-esummary.txt"
+
+ # special tests for dbVar summary, since IDs are reconstructed weekly
+ DoStart
+ res=$(
+ esearch -db dbvar -query '"study" [OT] AND "case set" [STYPE]' |
+ efetch -format docsum -start 1 -stop 1
+ )
+ DoStop
+ tst=$(
+ echo "$res" | xtract -pattern DocumentSummary -element Study_type
+ )
+ if [ "$tst" != "Case-Set" ]
+ then
+ fails=$(echo "esearch -db dbvar -query \"study AND case set\"")
+ MarkFailure "$fails" "$res"
+ printf "x"
+ else
+ printf "."
+ fi
+ DoTime
+
+ DoStart
+ res=$(
+ esearch -db dbvar -query 'pathogenic [CLIN] AND germline [ALLELE_ORIGIN] AND \
+ "nstd102" [ACC] AND brca1 [GENE_NAME] AND \
+ "copy number variation" [VT] AND "variant" [OT]' |
+ efetch -format docsum -start 1 -stop 1
+ )
+ DoStop
+ tst=$(
+ echo "$res" | xtract -pattern DocumentSummary -element dbVarGene/name
+ )
+ if [ "$tst" != "BRCA1" ]
+ then
+ fails=$(echo "esearch -db dbvar -query \"nstd102 AND brca1\"")
+ MarkFailure "$fails" "$res"
+ printf "x"
+ else
+ printf "."
+ fi
+ DoTime
+
printf "\n"
}
diff --git a/tst-esummary.txt b/tst-esummary.txt
index b5de796..eb00e25 100644
--- a/tst-esummary.txt
+++ b/tst-esummary.txt
@@ -8,7 +8,6 @@ blastdbinfo 998664
books 1371014
cdd 274590
clinvar 10510
-dbvar 6173073
gap 872875
gapplus 136686
gds 200022309
diff --git a/xtract.go b/xtract.go
index 1fa512c..847f2e8 100644
--- a/xtract.go
+++ b/xtract.go
@@ -299,7 +299,7 @@ Data Conversion
[-rec recordWrapper]
[-skip linesToSkip]
[-lower | -upper]
- [-indent]
+ [-indent | -flush]
XML object names per column
Documentation
@@ -2638,6 +2638,9 @@ func ParseArguments(cmdargs []string, pttrn string) *Block {
fmt.Fprintf(os.Stderr, "\nERROR: Unexpected position for %s command\n", txt)
os.Exit(1)
} else if txt == "-clr" {
+ // main loop runs out after trailing -clr, add another so this one will be executed
+ arguments = append(arguments, "-clr")
+ max++
} else if max < 2 || arguments[max-2] != "-lbl" {
fmt.Fprintf(os.Stderr, "\nERROR: Item missing after %s command\n", txt)
os.Exit(1)
@@ -8474,7 +8477,7 @@ func TableConverter(inp io.Reader, args []string) int {
skip := 0
lower := false
upper := false
- indent := false
+ indent := true
var fields []string
numFlds := 0
@@ -8529,6 +8532,9 @@ func TableConverter(inp io.Reader, args []string) int {
case "-indent":
indent = true
args = args[1:]
+ case "-flush":
+ indent = false
+ args = args[1:]
default:
// remaining arguments are names for columns
fields = append(fields, str)