diff options
author | Nilesh Patra <nilesh@debian.org> | 2021-08-15 15:17:57 +0200 |
---|---|---|
committer | Nilesh Patra <nilesh@debian.org> | 2021-08-15 15:17:57 +0200 |
commit | c814969eef5135efcbd686f55b0c7e6ea99af695 (patch) | |
tree | 2c5fdf3cdbb43c3dd0fc5575670cec02d87d3b04 |
Import python-pauvre_0.2.3.orig.tar.gz
[dgit import orig python-pauvre_0.2.3.orig.tar.gz]
34 files changed, 6053 insertions, 0 deletions
diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000..85feea3 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1 @@ +include scripts/test.sh diff --git a/PKG-INFO b/PKG-INFO new file mode 100644 index 0000000..b709225 --- /dev/null +++ b/PKG-INFO @@ -0,0 +1,25 @@ +Metadata-Version: 1.2 +Name: pauvre +Version: 0.2.3 +Summary: Tools for plotting Oxford Nanopore and other long-read data. +Home-page: https://github.com/conchoecia/pauvre +Author: Darrin Schultz +Author-email: dts@ucsc.edu +License: UNKNOWN +Description: + 'pauvre' is a package for plotting Oxford Nanopore and other long read data. + The name means 'poor' in French, a play on words to the oft-used 'pore' prefix + for similar packages. This package was designed for python 3, but it might work in + python 2. You can visit the gitub page for more detailed information here: + https://github.com/conchoecia/pauvre + +Platform: UNKNOWN +Classifier: Development Status :: 2 - Pre-Alpha +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Programming Language :: Python :: 3 +Classifier: Programming Language :: Python :: 3.6 +Classifier: Programming Language :: Python :: 3.7 +Classifier: Operating System :: POSIX :: Linux +Classifier: Topic :: Scientific/Engineering :: Bio-Informatics +Classifier: Intended Audience :: Science/Research +Requires-Python: >=3 diff --git a/README.md b/README.md new file mode 100644 index 0000000..fa14729 --- /dev/null +++ b/README.md @@ -0,0 +1,146 @@ +[![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670) + +## <a name="started"></a>Getting Started + +``` +pauvre custommargin -i custom.tsv --ycol length --xcol qual # Custom tsv input +``` + +## Table of Contents + +- [Getting Started](#started) +- [Users' Guide](#uguide) +- [Installation](#installation) + - [Requirements](#reqs) + - [Install Instructions](#install) +- [Usage](#usage) + - [pauvre stats](#stats) + - [pauvre marginplot](#marginplot) + - [Basic Usage](#marginbasic) + - [Plot Adjustments](#marginadjustments) + - [Specialized Options](#marginspecialized) +- [Contributors](#contributors) + +## <a name="uguide"></a>Users' Guide + +Pauvre is a plotting package originally designed to help QC the length and +quality distribution of Oxford Nanopore or PacBio reads. The main outputs +are marginplots. Now, `pauvre` also hosts other additional data plotting +scripts. + +This package currently hosts five scripts for plotting and/or printing stats. + +- `pauvre marginplot` + - takes a fastq file as input and outputs a marginal histogram with a heatmap. +- `pauvre custommargin` + - takes a tsv as input and outputs a marginal histogram with custom columns of your choice. +- `pauvre stats` + - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff. + - This is also automagically called when using `pauvre marginplot` +- `pauvre redwood` + - I am happy to introduce the redwood plot to the world as a method + of representing circular genomes. A redwood plot contains long + reads as "rings" on the inside, a gene annotation + "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files + for the long reads and RNAseq data, and a `.gff` file for the + annotation. More details to follow as we document this program + better... +- `pauvre synteny` + - Makes a synteny plot of circular genomes. Finds the most + parsimonius rotation to display the synteny of all the input + genomes with the fewest crossings-over. Input is one `.gff` file + per circular genome and one directory of gene alignments. + + +## <a name="installation"></a>Installation + +### <a name="reqs"></a>Requirements + +- You must have the following installed on your system to install this software: + - python 3.x + - matplotlib + - biopython + - pandas + - pillow + +### <a name="install">Install Instructions +- Instructions to install on your mac or linux system. Not sure on + Windows! Make sure *python 3* is the active environment before + installing. + - `git clone https://github.com/conchoecia/pauvre.git` + - `cd ./pauvre` + - `pip3 install .` +- Or, install with pip + - `pip3 install pauvre` + +## <a name="usage"><a/>Usage + +### <a name="stats"></a>`stats` + - generate basic statistics about the fastq file. For example, if I + want to know the number of bases and reads with AT LEAST a PHRED + score of 5 and AT LEAST a read length of 500, run the program as below + and look at the cells highlighted with `<braces>`. + - `pauvre stats --fastq miniDSMN15.fastq` + + +``` +numReads: 1000 +numBasepairs: 1029114 +meanLen: 1029.114 +medianLen: 875.5 +minLen: 11 +maxLen: 5337 +N50: 1278 +L50: 296 + + Basepairs >= bin by mean PHRED and length +minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0 + 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0 + 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0 + et cetera... + Number of reads >= bin by mean Phred+Len +minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + 0 1000 969 865 366 118 22 3 2 1 0 + 500 873 <859> 789 347 113 20 3 2 1 0 + 1000 424 418 396 187 62 11 1 1 1 0 + et cetera... +``` + +### <a name="marginplot"></a>`marginplot` + +#### <a name="marginbasic"></a>Basic Usage +- automatically calls `pauvre stats` for each fastq file +- Make the default plot showing the 99th percentile of longest reads + - `pauvre marginplot --fastq miniDSMN15.fastq` + - ![default](files/default_miniDSMN15.png) +- Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a + lower maxlen and higher maxqual. + - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq` + - ![example1](files/miniDSMN15.png) + +#### <a name="marginadjustments"></a>Plot Adjustments + +- Filter out reads with a mean quality less than 5, and a length + less than 800. Zoom in to plot only mean quality of at least 4 and + read length at least 500bp. + - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4` + - ![test4](files/test4.png) + +#### <a name="marginspecialized"></a>Specialized Options + +- Plot ONT 1D data with a large tail + - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq` +- Get more resolution on lengths + - `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq` + +- Turn off transparency if you just want a white background + - `pauvre marginplot --transparent False <myfile>.fastq` + - Note: transparency is the default behavior + - ![transparency](files/transparency.001.jpeg) + +## <a name="contributors"></a>Contributors + +@conchoecia (Darrin Schultz) +@mebbert (Mark Ebbert) +@wdecoster (Wouter De Coster) diff --git a/pauvre.egg-info/PKG-INFO b/pauvre.egg-info/PKG-INFO new file mode 100644 index 0000000..b709225 --- /dev/null +++ b/pauvre.egg-info/PKG-INFO @@ -0,0 +1,25 @@ +Metadata-Version: 1.2 +Name: pauvre +Version: 0.2.3 +Summary: Tools for plotting Oxford Nanopore and other long-read data. +Home-page: https://github.com/conchoecia/pauvre +Author: Darrin Schultz +Author-email: dts@ucsc.edu +License: UNKNOWN +Description: + 'pauvre' is a package for plotting Oxford Nanopore and other long read data. + The name means 'poor' in French, a play on words to the oft-used 'pore' prefix + for similar packages. This package was designed for python 3, but it might work in + python 2. You can visit the gitub page for more detailed information here: + https://github.com/conchoecia/pauvre + +Platform: UNKNOWN +Classifier: Development Status :: 2 - Pre-Alpha +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Programming Language :: Python :: 3 +Classifier: Programming Language :: Python :: 3.6 +Classifier: Programming Language :: Python :: 3.7 +Classifier: Operating System :: POSIX :: Linux +Classifier: Topic :: Scientific/Engineering :: Bio-Informatics +Classifier: Intended Audience :: Science/Research +Requires-Python: >=3 diff --git a/pauvre.egg-info/SOURCES.txt b/pauvre.egg-info/SOURCES.txt new file mode 100644 index 0000000..f92fd64 --- /dev/null +++ b/pauvre.egg-info/SOURCES.txt @@ -0,0 +1,32 @@ +MANIFEST.in +README.md +setup.py +pauvre/__init__.py +pauvre/bamparse.py +pauvre/browser.py +pauvre/custommargin.py +pauvre/functions.py +pauvre/gfftools.py +pauvre/marginplot.py +pauvre/pauvre_main.py +pauvre/rcparams.py +pauvre/redwood.py +pauvre/stats.py +pauvre/synplot.py +pauvre/version.py +pauvre.egg-info/PKG-INFO +pauvre.egg-info/SOURCES.txt +pauvre.egg-info/dependency_links.txt +pauvre.egg-info/entry_points.txt +pauvre.egg-info/not-zip-safe +pauvre.egg-info/requires.txt +pauvre.egg-info/top_level.txt +pauvre/lsi/Q.py +pauvre/lsi/T.py +pauvre/lsi/__init__.py +pauvre/lsi/helper.py +pauvre/lsi/lsi.py +pauvre/lsi/test.py +pauvre/tests/__init__.py +pauvre/tests/test_synplot.py +scripts/test.sh
\ No newline at end of file diff --git a/pauvre.egg-info/dependency_links.txt b/pauvre.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/pauvre.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/pauvre.egg-info/entry_points.txt b/pauvre.egg-info/entry_points.txt new file mode 100644 index 0000000..704e4fa --- /dev/null +++ b/pauvre.egg-info/entry_points.txt @@ -0,0 +1,3 @@ +[console_scripts] +pauvre = pauvre.pauvre_main:main + diff --git a/pauvre.egg-info/not-zip-safe b/pauvre.egg-info/not-zip-safe new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/pauvre.egg-info/not-zip-safe @@ -0,0 +1 @@ + diff --git a/pauvre.egg-info/requires.txt b/pauvre.egg-info/requires.txt new file mode 100644 index 0000000..2df747c --- /dev/null +++ b/pauvre.egg-info/requires.txt @@ -0,0 +1,6 @@ +matplotlib>=2.0.2 +biopython>=1.68 +pandas>=0.20.1 +numpy>=1.12.1 +scipy +scikit-learn diff --git a/pauvre.egg-info/top_level.txt b/pauvre.egg-info/top_level.txt new file mode 100644 index 0000000..57f5848 --- /dev/null +++ b/pauvre.egg-info/top_level.txt @@ -0,0 +1 @@ +pauvre diff --git a/pauvre/__init__.py b/pauvre/__init__.py new file mode 100644 index 0000000..81f6fbf --- /dev/null +++ b/pauvre/__init__.py @@ -0,0 +1 @@ +from pauvre.version import __version__ diff --git a/pauvre/bamparse.py b/pauvre/bamparse.py new file mode 100644 index 0000000..5791a2c --- /dev/null +++ b/pauvre/bamparse.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# twitter @conchoecia +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. +import pysam +import pandas as pd +import os + +class BAMParse(): + """This class reads in a sam/bam file and constructs a pandas + dataframe of all the relevant information for the reads to pass on + and plot. + """ + def __init__(self, filename, chrid = None, start = None, + stop = None, doubled = None): + self.filename = filename + self.doubled = doubled + #determine if the file is bam or sam + self.filetype = os.path.splitext(self.filename)[1] + #throw an error if the file is not bam or sam + if self.filetype not in ['.bam']: + raise Exception("""You have provided a file with an extension other than + '.bam', please check your command-line arguments""") + #now make sure there is an index file for the bam file + if not os.path.exists("{}.bai".format(self.filename)): + raise Exception("""Your .bam file is there, but it isn't indexed and + there isn't a .bai file to go with it. Use + 'samtools index <yourfile>.bam' to fix it.""") + #now open the file and just call it a sambam file + filetype_dict = {'.sam': '', '.bam': 'b'} + self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype])) + if chrid == None: + self.chrid = self.sambam.references[0] + else: + self.chrid = chrid + self.refindex = self.sambam.references.index(self.chrid) + self.seqlength = self.sambam.lengths[self.refindex] + self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2) + if start == None or stop == None: + self.start = 1 + self.stop = self.true_seqlength + + self.features = self.parse() + self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True) + self.features.reset_index(inplace=True) + self.features.drop('index', 1, inplace=True) + + self.raw_depthmap = self.get_depthmap() + self.features_depthmap = self.get_features_depthmap() + + def get_depthmap(self): + depthmap = [0] * (self.stop - self.start + 1) + for p in self.sambam.pileup(self.chrid, self.start, self.stop): + index = p.reference_pos + if index >= self.true_seqlength: + index -= self.true_seqlength + depthmap[index] += p.nsegments + return depthmap + + def get_features_depthmap(self): + """this method builds a more accurate pileup that is + based on if there is actually a mapped base at any + given position or not. better for long reads and RNA""" + depthmap = [0] * (self.stop - self.start + 1) + print("depthmap is: {} long".format(len(depthmap))) + for index, row in self.features.iterrows(): + thisindex = row["POS"] - self.start + for thistup in row["TUPS"]: + b_type = thistup[1] + b_len = thistup[0] + if b_type == "M": + for j in range(b_len): + #this is necessary to reset the index if we wrap + # around to the beginning + if self.doubled and thisindex == len(depthmap): + thisindex = 0 + depthmap[thisindex] += 1 + thisindex += 1 + elif b_type in ["S", "H", "I"]: + pass + elif b_type in ["D", "N"]: + thisindex += b_len + #this is necessary to reset the index if we wrap + # around to the beginning + if self.doubled and thisindex >= len(depthmap): + thisindex = thisindex - len(depthmap) + + return depthmap + + def parse(self): + data = {'POS': [], 'MAPQ': [], 'TUPS': [] } + for read in self.sambam.fetch(self.chrid, self.start, self.stop): + data['POS'].append(read.reference_start + 1) + data['TUPS'].append(self.cigar_parse(read.cigartuples)) + data['MAPQ'].append(read.mapq) + features = pd.DataFrame.from_dict(data, orient='columns') + features['ALNLEN'] = features['TUPS'].apply(self.aln_len) + features['TRULEN'] = features['TUPS'].apply(self.tru_len) + features['MAPLEN'] = features['TUPS'].apply(self.map_len) + features['POS'] = features['POS'].apply(self.fix_pos) + return features + + def cigar_parse(self, tuples): + """ + arguments: + <tuples> a CIGAR string tuple list in pysam format + + purpose: + This function uses the pysam cigarstring tuples format and returns + a list of tuples in the internal format, [(20, 'M'), (5, "I")], et + cetera. The zeroth element of each tuple is the number of bases for the + CIGAR string feature. The first element of each tuple is the CIGAR + string feature type. + + There are several feature types in SAM/BAM files. See below: + 'M' - match + 'I' - insertion relative to reference + 'D' - deletion relative to reference + 'N' - skipped region from the reference + 'S' - soft clip, not aligned but still in sam file + 'H' - hard clip, not aligned and not in sam file + 'P' - padding (silent deletion from padded reference) + '=' - sequence match + 'X' - sequence mismatch + 'B' - BAM_CBACK (I don't actually know what this is) + + """ + # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment + psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S', + 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'} + return [(value, psam_to_char[feature]) for feature, value in tuples] + + def aln_len(self, TUPS): + """ + arguments: + <TUPS> a list of tuples output from the cigar_parse() function. + + purpose: + This returns the alignment length of the read to the reference. + Specifically, it sums the length of all of the matches and deletions. + In effect, this number is length of the region of the reference sequence to + which the read maps. This number is probably the most useful for selecting + reads to visualize in the mapped read plot. + """ + return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']]) + + def map_len(self, TUPS): + """ + arguments: + <TUPS> a list of tuples output from the cigar_parse() function. + + purpose: + This function returns the map length (all matches and deletions relative to + the reference), plus the unmapped 5' and 3' hard/soft clipped sequences. + This number is useful if you want to visualize how much 5' and 3' sequence + of a read did not map to the reference. For example, poor quality 5' and 3' + tails are common in Nanopore reads. + """ + return sum([pair[0] for pair in TUPS if pair[1] not in ['I']]) + + def tru_len(self, TUPS): + """ + arguments: + <TUPS> a list of tuples output from the cigar_parse() function. + + purpose: + This function returns the total length of the read, including insertions, + deletions, matches, soft clips, and hard clips. This is useful for + comparing to the map length or alignment length to see what percentage of + the read aligned to the reference. + """ + return sum([pair[0] for pair in TUPS]) + + def fix_pos(self, start_index): + """ + arguments: + an int + + purpose: + When using a doubled SAMfile, any reads that start after the first copy + of the reference risk running over the plotting window, causing the program + to crash. This function corrects for this issue by changing the start site + of the read. + + Note: this will probably break the program if not using a double alignment + since no reads would map past half the length of the single reference + """ + if self.doubled: + if start_index > int(self.seqlength/2): + return start_index - int(self.seqlength/2) - 1 + else: + return start_index + else: + return start_index diff --git a/pauvre/browser.py b/pauvre/browser.py new file mode 100644 index 0000000..d6b96ec --- /dev/null +++ b/pauvre/browser.py @@ -0,0 +1,426 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# following this tutorial to install helvetica +# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md +global hfont +hfont = {'fontname':'Helvetica'} + +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from matplotlib.colors import LinearSegmentedColormap, Normalize +import matplotlib.patches as patches + + +import gffutils +import pandas as pd +pd.set_option('display.max_columns', 500) +pd.set_option('display.width', 1000) +import numpy as np +import os +import pauvre.rcparams as rc +from pauvre.functions import GFFParse, print_images, timestamp +from pauvre import gfftools +from pauvre.lsi.lsi import intersection +from pauvre.bamparse import BAMParse +import progressbar +import platform +import sys +import time + +# Biopython stuff +from Bio import SeqIO +import Bio.SubsMat.MatrixInfo as MI + + +class PlotCommand: + def __init__(self, plotcmd, REF): + self.ref = REF + self.style_choices = [] + self.cmdtype = "" + self.path = "" + self.style = "" + self.options = "" + self._parse_cmd(plotcmd) + + def _parse_cmd(self, plotcmd): + chunks = plotcmd.split(":") + if chunks[0] == "ref": + self.cmdtype = "ref" + if len(chunks) < 2: + self._len_error() + self.path = self.ref + self.style = chunks[1] + self.style_choices = ["normal", "colorful"] + self._check_style_choices() + if len(chunks) > 2: + self.options = chunks[2].split(",") + elif chunks[0] in ["bam", "peptides"]: + if len(chunks) < 3: + self._len_error() + self.cmdtype = chunks[0] + self.path = os.path.abspath(os.path.expanduser(chunks[1])) + self.style = chunks[2] + if self.cmdtype == "bam": + self.style_choices = ["depth", "reads"] + else: + self.style_choices = ["depth"] + self._check_style_choices() + if len(chunks) > 3: + self.options = chunks[3].split(",") + elif chunks[0] in ["gff3"]: + if len(chunks) < 2: + self._len_error() + self.cmdtype = chunks[0] + self.path = os.path.abspath(os.path.expanduser(chunks[1])) + if len(chunks) > 2: + self.options = chunks[2].split(",") + + + def _len_error(self): + raise IOError("""You selected {} to plot, + but need to specify the style at least.""".format(self.cmdtype)) + def _check_style_choices(self): + if self.style not in self.style_choices: + raise IOError("""You selected {} style for + ref. You must select from {}. """.format( + self.style, self.style_choices)) + +global dna_color +dna_color = {"A": (81/255, 87/255, 251/255, 1), + "T": (230/255, 228/255, 49/255, 1), + "G": (28/255, 190/255, 32/255, 1), + "C": (220/255, 10/255, 23/255, 1)} + +#these are the line width for the different cigar string flags. +# usually, only M, I, D, S, and H appear in bwa mem output +global widthDict +widthDict = {'M':0.45, # match + 'I':0.9, # insertion relative to reference + 'D':0.05, # deletion relative to reference + 'N':0.1, # skipped region from the reference + 'S':0.1, # soft clip, not aligned but still in sam file + 'H':0.1, # hard clip, not aligned and not in sam file + 'P':0.1, # padding (silent deletion from padded reference) + '=':0.1, # sequence match + 'X':0.1} # sequence mismatch + + +global richgrey +richgrey = (60/255, 54/255, 69/255, 1) + +def plot_ref(panel, chrid, start, stop, thiscmd): + panel.set_xlim([start, stop]) + panel.set_ylim([-2.5, 2.5]) + panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)]) + if thiscmd.style == "colorful": + thisseq = "" + for record in SeqIO.parse(thiscmd.ref, "fasta"): + if record.id == chrid: + thisseq = record.seq[start-1: stop] + for i in range(len(thisseq)): + left = start + i + bottom = -0.5 + width = 1 + height = 1 + rect = patches.Rectangle((left, bottom), + width, height, + linewidth = 0, + facecolor = dna_color[thisseq[i]] ) + panel.add_patch(rect) + return panel + +def safe_log10(value): + try: + logval = np.log10(value) + except: + logval = 0 + return logval + +def plot_bam(panel, chrid, start, stop, thiscmd): + bam = BAMParse(thiscmd.path) + panel.set_xlim([start, stop]) + if thiscmd.style == "depth": + maxdepth = max(bam.features_depthmap) + maxdepthlog = safe_log10(maxdepth) + if "log" in thiscmd.options: + panel.set_ylim([-maxdepthlog, maxdepthlog]) + panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)]) + + else: + panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)]) + if "c" in thiscmd.options: + panel.set_ylim([-maxdepth, maxdepth]) + else: + panel.set_ylim([0, maxdepth]) + + + for i in range(len(bam.features_depthmap)): + left = start + i + width = 1 + if "c" in thiscmd.options and "log" in thiscmd.options: + bottom = -1 * safe_log10(bam.features_depthmap[i]) + height = safe_log10(bam.features_depthmap[i]) * 2 + elif "c" in thiscmd.options and "log" not in thiscmd.options: + bottom = -bam.features_depthmap[i] + height = bam.features_depthmap[i] * 2 + else: + bottom = 0 + height = bam.features_depthmap[i] + if height > 0: + rect = patches.Rectangle((left, bottom), + width, height, + linewidth = 0, + facecolor = richgrey ) + panel.add_patch(rect) + + if thiscmd.style == "reads": + #If we're plotting reads, we don't need y-axis + panel.tick_params(bottom="off", labelbottom="off", + left ="off", labelleft = "off") + reads = bam.features.copy() + panel.set_xlim([start, stop]) + direction = "for" + if direction == 'for': + bav = {"by":['POS','MAPLEN'], "asc": [True, False]} + direction= 'rev' + elif direction == 'rev': + bav = {"by":['POS','MAPLEN'], "asc": [True, False]} + direction = 'for' + reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True) + reads.reset_index(drop=True, inplace=True) + + depth_count = -1 + plotind = start + while len(reads) > 0: + #depth_count -= 1 + #print("len of reads is {}".format(len(reads))) + potential = reads.query("POS >= {}".format(plotind)) + if len(potential) == 0: + readsindex = 0 + #print("resetting plot ind from {} to {}".format( + # plotind, reads.loc[readsindex, "POS"])) + depth_count -= 1 + + else: + readsindex = int(potential.index.values[0]) + #print("pos of potential is {}".format(reads.loc[readsindex, "POS"])) + plotind = reads.loc[readsindex, "POS"] + + for TUP in reads.loc[readsindex, "TUPS"]: + b_type = TUP[1] + b_len = TUP[0] + #plotting params + # left same for all. + left = plotind + bottom = depth_count + height = widthDict[b_type] + width = b_len + plot = True + color = richgrey + if b_type in ["H", "S"]: + """We don't plot hard or sort clips - like IGV""" + plot = False + pass + elif b_type == "M": + """just plot matches normally""" + plotind += b_len + elif b_type in ["D", "P", "=", "X"]: + """deletions get an especially thin line""" + plotind += b_len + elif b_type == "I": + """insertions get a special purple bar""" + left = plotind - (b_len/2) + color = (200/255, 41/255, 226/255, 0.5) + elif b_type == "N": + """skips for splice junctions, line in middle""" + bottom += (widthDict["M"]/2) - (widthDict["N"]/2) + plotind += b_len + if plot: + rect = patches.Rectangle((left, bottom), + width, height, + linewidth = 0, + facecolor = color ) + panel.add_patch(rect) + reads.drop([readsindex], inplace=True) + reads.reset_index(drop = True, inplace=True) + panel.set_ylim([depth_count, 0]) + + return panel + +def plot_gff3(panel, chrid, start, stop, thiscmd): + + db = gffutils.create_db(thiscmd.path, ":memory:") + bottom = 0 + genes_to_plot = [thing.id + for thing in db.region( + region=(chrid, start, stop), + completely_within=False) + if thing.featuretype == "gene" ] + #print("genes to plot are: " genes_to_plot) + panel.set_xlim([start, stop]) + # we don't need labels on one of the axes + #panel.tick_params(bottom="off", labelbottom="off", + # left ="off", labelleft = "off") + + + ticklabels = [] + for geneid in genes_to_plot: + plotnow = False + if "id" in thiscmd.options and geneid in thiscmd.options: + plotnow = True + elif "id" not in thiscmd.options: + plotnow = True + if plotnow: + ticklabels.append(geneid) + if db[geneid].strand == "+": + panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db, + bottom, text = None) + bottom += 1 + else: + raise IOError("""Plotting things on the reverse strand is + not yet implemented""") + #print("tick labels are", ticklabels) + panel.set_ylim([0, len(ticklabels)]) + yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))] + panel.set_yticks(yticks_vals) + print("bottom is: ", bottom) + print("len tick labels is: ", len(ticklabels)) + print("intervals are: ", yticks_vals) + panel.set_yticklabels(ticklabels) + + return panel + +def browser(args): + rc.update_rcParams() + print(args) + + # if the user forgot to add a reference, they must add one + if args.REF is None: + raise IOError("You must specify the reference fasta file") + + # if the user forgot to add the start and stop, + # Print the id and the start/stop + if args.CHR is None or args.START is None or args.STOP is None: + print("""\n You have forgotten to specify the chromosome, + the start coordinate, or the stop coordinate to plot. + Try something like '-c chr1 --start 20 --stop 2000'. + Here is a list of chromosome ids and their lengths + from the provided reference. The minimum start coordinate + is one and the maximum stop coordinate is the length of + the chromosome.\n\nID\tLength""") + for record in SeqIO.parse(args.REF, "fasta"): + print("{}\t{}".format(record.id, len(record.seq))) + sys.exit(0) + + if args.CMD is None: + raise IOError("You must specify a plotting command.") + + # now we parse each set of commands + commands = [PlotCommand(thiscmd, args.REF) + for thiscmd in reversed(args.CMD)] + + # set the figure dimensions + if args.ratio: + figWidth = args.ratio[0] + 1 + figHeight = args.ratio[1] + 1 + #set the panel dimensions + panelWidth = args.ratio[0] + panelHeight = args.ratio[1] + + else: + figWidth = 7 + figHeight = len(commands) + 2 + #set the panel dimensions + panelWidth = 5 + # panel margin x 2 + panel height = total vertical height + panelHeight = 0.8 + panelMargin = 0.1 + + figure = plt.figure(figsize=(figWidth,figHeight)) + + #find the margins to center the panel in figure + leftMargin = (figWidth - panelWidth)/2 + bottomMargin = ((figHeight - panelHeight)/2) + panelMargin + + plot_dict = {"ref": plot_ref, + "bam": plot_bam, + "gff3": plot_gff3 + #"peptides": plot_peptides + } + + panels = [] + for i in range(len(commands)): + thiscmd = commands[i] + if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \ + or thiscmd.style == "depth" \ + or "narrow" in thiscmd.options: + temp_panelHeight = 0.5 + else: + temp_panelHeight = panelHeight + panels.append( plt.axes([leftMargin/figWidth, #left + bottomMargin/figHeight, #bottom + panelWidth/figWidth, #width + temp_panelHeight/figHeight]) #height + ) + panels[i].tick_params(axis='both',which='both',\ + bottom='off', labelbottom='off',\ + left='on', labelleft='on', \ + right='off', labelright='off',\ + top='off', labeltop='off') + if thiscmd.cmdtype == "ref": + panels[i].tick_params(bottom='on', labelbottom='on') + + + + #turn off some of the axes + panels[i].spines["top"].set_visible(False) + panels[i].spines["bottom"].set_visible(False) + panels[i].spines["right"].set_visible(False) + panels[i].spines["left"].set_visible(False) + + panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR, + args.START, args.STOP, + thiscmd) + + bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin) + + # Print image(s) + if args.BASENAME is None: + file_base = 'browser_{}.png'.format(timestamp()) + else: + file_base = args.BASENAME + path = None + if args.path: + path = args.path + transparent = args.transparent + print_images( + base_output_name=file_base, + image_formats=args.fileform, + dpi=args.dpi, + no_timestamp = kwargs["no_timestamp"], + path = path, + transparent=transparent) + + +def run(args): + browser(args) diff --git a/pauvre/custommargin.py b/pauvre/custommargin.py new file mode 100644 index 0000000..d522e42 --- /dev/null +++ b/pauvre/custommargin.py @@ -0,0 +1,441 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore PhD student's plotting package +# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +import ast +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.patches as mplpatches +from matplotlib.colors import LinearSegmentedColormap +import numpy as np +import pandas as pd +import os.path as opath +from sys import stderr +from pauvre.functions import print_images +from pauvre.marginplot import generate_panel +from pauvre.stats import stats +import pauvre.rcparams as rc +import sys +import logging + +# logging +logger = logging.getLogger('pauvre') + +def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): + """This helper method generates the histogram that is added to the panel. + + In this case, horizontal = True applies to the mean quality histogram. + So, horizontal = False only applies to the length histogram. + """ + l_width = 0.0 + f_color = (0.5, 0.5, 0.5) + e_color = (0, 0, 0) + if horizontal: + for step in np.arange(0, len(bin_values), 1): + left = bins[step] + bottom = 0 + width = bins[step + 1] - bins[step] + height = bin_values[step] + hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, + linewidth=l_width, + facecolor=f_color, + edgecolor=e_color) + panel.add_patch(hist_rectangle) + else: + for step in np.arange(0, len(bin_values), 1): + left = 0 + bottom = bins[step] + width = bin_values[step] + height = bins[step + 1] - bins[step] + + hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, + linewidth=l_width, + facecolor=f_color, + edgecolor=e_color) + panel.add_patch(hist_rectangle) + + +def generate_histogram(panel, data_list, min_plot_val, max_plot_val, + bin_interval, hist_horizontal=True, + left_spine=True, bottom_spine=True, + top_spine=False, right_spine=False, x_label=None, + y_label=None): + + bins = np.arange(0, max_plot_val, bin_interval) + + bin_values, bins2 = np.histogram(data_list, bins) + + # hist_horizontal is used for quality + if hist_horizontal: + panel.set_xlim([min_plot_val, max_plot_val]) + panel.set_ylim([0, max(bin_values * 1.1)]) + # and hist_horizontal == Fale is for read length + else: + panel.set_xlim([0, max(bin_values * 1.1)]) + panel.set_ylim([min_plot_val, max_plot_val]) + + # Generate histogram bin patches, depending on whether we're plotting + # vertically or horizontally + _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal) + + panel.spines['left'].set_visible(left_spine) + panel.spines['bottom'].set_visible(bottom_spine) + panel.spines['top'].set_visible(top_spine) + panel.spines['right'].set_visible(right_spine) + + if y_label is not None: + panel.set_ylabel(y_label) + if x_label is not None: + panel.set_xlabel(x_label) + +def generate_square_map(panel, data_frame, plot_min_y, plot_min_x, + plot_max_y, plot_max_x, color, + xcol, ycol, **kwargs): + """This generates the heatmap panels using squares. Everything is + quantized by ints. + """ + panel.set_xlim([plot_min_x, plot_max_x]) + panel.set_ylim([plot_min_y, plot_max_y]) + tempdf = data_frame[[xcol, ycol]] + data_frame = tempdf.astype(int) + + querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) + print(" - Filtering squares with {}".format(querystring)) + square_this = data_frame.query(querystring) + + querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) + print(" - Filtering squares with {}".format(querystring)) + square_this = square_this.query(querystring) + + counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts') + for index, row in counts.iterrows(): + x_pos = row[xcol] + y_pos = row[ycol] + thiscolor = color(row["counts"]/(counts["counts"].max())) + rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1, + linewidth=0,\ + facecolor=thiscolor) + panel.add_patch(rectangle1) + + all_counts = counts["counts"] + return all_counts + +def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x, + plot_max_y, plot_max_x, color, + xcol, ycol, **kwargs): + panel.set_xlim([plot_min_x, plot_max_x]) + panel.set_ylim([plot_min_y, plot_max_y]) + + querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) + print(" - Filtering hexmap with {}".format(querystring)) + hex_this = data_frame.query(querystring) + + querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) + print(" - Filtering hexmap with {}".format(querystring)) + hex_this = hex_this.query(querystring) + + # This single line controls plotting the hex bins in the panel + hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49, + linewidths=0.0, cmap=color) + for each in panel.spines: + panel.spines[each].set_visible(False) + + counts = hex_vals.get_array() + return counts + +def generate_legend(panel, counts, color): + # completely custom for more control + panel.set_xlim([0, 1]) + panel.set_ylim([0, 1000]) + panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)]) + panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)]) + for i in np.arange(0, 1001, 1): + rgba = color(i / 1001) + alpha = rgba[-1] + facec = rgba[0:3] + hist_rectangle = mplpatches.Rectangle((0, i), 1, 1, + linewidth=0.0, + facecolor=facec, + edgecolor=(0, 0, 0), + alpha=alpha) + panel.add_patch(hist_rectangle) + panel.spines['top'].set_visible(False) + panel.spines['left'].set_visible(False) + panel.spines['bottom'].set_visible(False) + panel.yaxis.set_label_position("right") + panel.set_ylabel('count') + +def custommargin(df, **kwargs): + rc.update_rcParams() + + # 250, 231, 34 light yellow + # 67, 1, 85 + # R=np.linspace(65/255,1,101) + # G=np.linspace(0/255, 231/255, 101) + # B=np.linspace(85/255, 34/255, 101) + # R=65/255, G=0/255, B=85/255 + Rf = 65 / 255 + Bf = 85 / 255 + pdict = {'red': ((0.0, Rf, Rf), + (1.0, Rf, Rf)), + 'green': ((0.0, 0.0, 0.0), + (1.0, 0.0, 0.0)), + 'blue': ((0.0, Bf, Bf), + (1.0, Bf, Bf)), + 'alpha': ((0.0, 0.0, 0.0), + (1.0, 1.0, 1.0)) + } + # Now we will use this example to illustrate 3 ways of + # handling custom colormaps. + # First, the most direct and explicit: + purple1 = LinearSegmentedColormap('Purple1', pdict) + + # set the figure dimensions + fig_width = 1.61 * 3 + fig_height = 1 * 3 + fig = plt.figure(figsize=(fig_width, fig_height)) + + # set the panel dimensions + heat_map_panel_width = fig_width * 0.5 + heat_map_panel_height = heat_map_panel_width * 0.62 + + # find the margins to center the panel in figure + fig_left_margin = fig_bottom_margin = (1 / 6) + + # lengthPanel + y_panel_width = (1 / 8) + + # the color Bar parameters + legend_panel_width = (1 / 24) + + # define padding + h_padding = 0.02 + v_padding = 0.05 + + # Set whether to include y-axes in histograms + print(" - Setting panel options.", file = sys.stderr) + if kwargs["Y_AXES"]: + y_bottom_spine = True + y_bottom_tick = True + y_bottom_label = True + x_left_spine = True + x_left_tick = True + x_left_label = True + x_y_label = 'Count' + else: + y_bottom_spine = False + y_bottom_tick = False + y_bottom_label = False + x_left_spine = False + x_left_tick = False + x_left_label = False + x_y_label = None + + panels = [] + + # Quality histogram panel + print(" - Generating the x-axis panel.", file = sys.stderr) + x_panel_left = fig_left_margin + y_panel_width + h_padding + x_panel_width = heat_map_panel_width / fig_width + x_panel_height = y_panel_width * fig_width / fig_height + x_panel = generate_panel(x_panel_left, + fig_bottom_margin, + x_panel_width, + x_panel_height, + left_tick_param=x_left_tick, + label_left_tick_param=x_left_label) + panels.append(x_panel) + + # y histogram panel + print(" - Generating the y-axis panel.", file = sys.stderr) + y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding + y_panel_height = heat_map_panel_height / fig_height + y_panel = generate_panel(fig_left_margin, + y_panel_bottom, + y_panel_width, + y_panel_height, + bottom_tick_param=y_bottom_tick, + label_bottom_tick_param=y_bottom_label) + panels.append(y_panel) + + # Heat map panel + heat_map_panel_left = fig_left_margin + y_panel_width + h_padding + heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding + print(" - Generating the heat map panel.", file = sys.stderr) + heat_map_panel = generate_panel(heat_map_panel_left, + heat_map_panel_bottom, + heat_map_panel_width / fig_width, + heat_map_panel_height / fig_height, + bottom_tick_param='off', + label_bottom_tick_param='off', + left_tick_param='off', + label_left_tick_param='off') + panels.append(heat_map_panel) + heat_map_panel.set_title(kwargs["title"]) + + # Legend panel + print(" - Generating the legend panel.", file = sys.stderr) + legend_panel_left = fig_left_margin + y_panel_width + \ + heat_map_panel_width / fig_width + h_padding + legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding + legend_panel_height = heat_map_panel_height / fig_height + legend_panel = generate_panel(legend_panel_left, legend_panel_bottom, + legend_panel_width, legend_panel_height, + bottom_tick_param=False, + label_bottom_tick_param=False, + left_tick_param=False, + label_left_tick_param=False, + right_tick_param=True, + label_right_tick_param=True) + panels.append(legend_panel) + + # + # Everything above this is just to set up the panels + # + ################################################################## + + # Set max and min viewing window for the xaxis + if kwargs["plot_max_x"]: + plot_max_x = kwargs["plot_max_x"] + else: + if kwargs["square"]: + plot_max_x = df[kwargs["xcol"]].max() + plot_max_x = max(np.ceil(df[kwargs["xcol"]])) + plot_min_x = kwargs["plot_min_x"] + + # Set x bin sizes + if kwargs["xbin"]: + x_bin_interval = kwargs["xbin"] + else: + # again, this is just based on what looks good from experience + x_bin_interval = 1 + + # Generate x histogram + print(" - Generating the x-axis histogram.", file = sys.stderr) + generate_histogram(panel = x_panel, + data_list = df[kwargs['xcol']], + min_plot_val = plot_min_x, + max_plot_val = plot_max_x, + bin_interval = x_bin_interval, + hist_horizontal = True, + x_label=kwargs['xcol'], + y_label=x_y_label, + left_spine=x_left_spine) + + # Set max and min viewing window for the y axis + if kwargs["plot_max_y"]: + plot_max_y = kwargs["plot_max_y"] + else: + if kwargs["square"]: + plot_max_y = df[kwargs["ycol"]].max() + else: + plot_max_y = max(np.ceil(df[kwargs["ycol"]])) + + plot_min_y = kwargs["plot_min_y"] + # Set y bin sizes + if kwargs["ybin"]: + y_bin_interval = kwargs["ybin"] + else: + y_bin_interval = 1 + + # Generate y histogram + print(" - Generating the y-axis histogram.", file = sys.stderr) + generate_histogram(panel = y_panel, + data_list = df[kwargs['ycol']], + min_plot_val = plot_min_y, + max_plot_val = plot_max_y, + bin_interval = y_bin_interval, + hist_horizontal = False, + y_label = kwargs['ycol'], + bottom_spine = y_bottom_spine) + + # Generate heat map + if kwargs["square"]: + print(" - Generating the square heatmap.", file = sys.stderr) + counts = generate_square_map(panel = heat_map_panel, + data_frame = df, + plot_min_y = plot_min_y, + plot_min_x = plot_min_x, + plot_max_y = plot_max_y, + plot_max_x = plot_max_x, + color = purple1, + xcol = kwargs["xcol"], + ycol = kwargs["ycol"]) + else: + print(" - Generating the heatmap.", file = sys.stderr) + counts = generate_heat_map(panel = heat_map_panel, + data_frame = df, + plot_min_y = plot_min_y, + plot_min_x = plot_min_x, + plot_max_y = plot_max_y, + plot_max_x = plot_max_x, + color = purple1, + xcol = kwargs["xcol"], + ycol = kwargs["ycol"]) + + # Generate legend + print(" - Generating the legend.", file = sys.stderr) + generate_legend(legend_panel, counts, purple1) + + # inform the user of the plotting window if not quiet mode + #if not kwargs["QUIET"]: + # print("""plotting in the following window: + # {0} <= Q-score (x-axis) <= {1} + # {2} <= length (y-axis) <= {3}""".format( + # plot_min_x, plot_max_x, min_plot_val, max_plot_val), + # file=stderr) + + # Print image(s) + if kwargs["output_base_name"] is None: + file_base = "custommargin" + else: + file_base = kwargs["output_base_name"] + + print(" - Saving your images", file = sys.stderr) + print_images( + base =file_base, + image_formats=kwargs["fileform"], + dpi=kwargs["dpi"], + no_timestamp = kwargs["no_timestamp"], + transparent= kwargs["no_transparent"]) + +def run(args): + print(args) + if not opath.exists(args.input_file): + raise IOError("The input file does not exist: {}".format( + args.input_file)) + df = pd.read_csv(args.input_file, header='infer', sep='\t') + # make sure that the column names that were specified are actually + # in the dataframe + if args.xcol not in df.columns: + raise IOError("""The x-column name that you specified, {}, is not in the + dataframe column names: {}""".format(args.xcol, df.columns)) + if args.ycol not in df.columns: + raise IOError("""The y-column name that you specified, {}, is not in the + dataframe column names: {}""".format(args.ycol, df.columns)) + print(" - Successfully read csv file. Here are a few lines:", + file = sys.stderr) + print(df.head(), file = sys.stderr) + print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr) + print(df[args.xcol].head(), file = sys.stderr) + print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr) + print(df[args.ycol].head(), file = sys.stderr) + custommargin(df=df.dropna(), **vars(args)) diff --git a/pauvre/functions.py b/pauvre/functions.py new file mode 100644 index 0000000..c6ca2b8 --- /dev/null +++ b/pauvre/functions.py @@ -0,0 +1,347 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre +# Copyright (c) 2016-2020 Darrin T. Schultz. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +from Bio import SeqIO +import copy +import gzip +import matplotlib.pyplot as plt +import numpy as np +import os +import pandas as pd +from sys import stderr +import time + + +# this makes opening files more robust for different platforms +# currently only used in GFFParse +import codecs + +import warnings + +def print_images(base, image_formats, dpi, + transparent=False, no_timestamp = False): + """ + Save the plot in multiple formats, with or without transparency + and with or without timestamps. + """ + for fmt in image_formats: + if no_timestamp: + out_name = "{0}.{1}".format(base, fmt) + else: + out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt) + try: + if fmt == 'png': + plt.savefig(out_name, dpi=dpi, transparent=transparent) + else: + plt.savefig(out_name, format=fmt, transparent=transparent) + except PermissionError: + # thanks to https://github.com/wdecoster for the suggestion + print("""You don't have permission to save pauvre plots to this + directory. Try changing the directory and running the script again!""") + +class GFFParse(): + def __init__(self, filename, stop_codons=None, species=None): + self.filename = filename + self.samplename = os.path.splitext(os.path.basename(filename))[0] + self.species = species + self.featureDict = {"name": [], + "featType": [], + "start": [], + "stop": [], + "strand": []} + gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1", + "strand", "dunno2", "tags"] + self.features = pd.read_csv(self.filename, comment='#', + sep='\t', names=gffnames) + self.features['name'] = self.features['tags'].apply(self._get_name) + self.features.drop('dunno1', 1, inplace=True) + self.features.drop('dunno2', 1, inplace=True) + self.features.reset_index(inplace=True, drop=True) + # warn the user if there are CDS or gene entries not divisible by three + self._check_triplets() + # sort the database by start + self.features.sort_values(by='start', ascending=True, inplace=True) + if stop_codons: + strip_codons = ['gene', 'CDS'] + # if the direction is forward, subtract three from the stop to bring it closer to the start + self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\ + self.features.loc[(self.features['featType'].isin(strip_codons)) + & (self.features['strand'] == '+'), 'stop'] - 3 + # if the direction is reverse, add three to the start (since the coords are flip-flopped) + self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\ + self.features.loc[(self.features['featType'].isin(strip_codons)) + & (self.features['strand'] == '-'), 'start'] + 3 + self.features['center'] = self.features['start'] + \ + ((self.features['stop'] - self.features['start']) / 2) + # we need to add one since it doesn't account for the last base otherwise + self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1 + self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1) + self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1) + self.features['track'] = 0 + if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1: + raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true" + with a region from 1 to the number of bases in the mitogenome + + example: + Bf201311 Geneious region 1 13337 . + 0 Is_circular=true + """) + self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) + self.features.reset_index(inplace=True, drop=True) + #print("float", self.features.loc[self.features['name'] == 'COX1', 'center']) + #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center'])) + # print(self.features) + # print(self.seqlen) + + def set_features(self, new_features): + """all this does is reset the features pandas dataframe""" + self.features = new_features + + def get_unique_genes(self): + """This returns a series of gene names""" + plottable = self.features.query( + "featType != 'tRNA' and featType != 'region' and featType != 'source'") + return set(plottable['name'].unique()) + + def shuffle(self): + """ + this returns a list of all possible shuffles of features. + A shuffle is when the frontmost bit of coding + noncoding DNA up + until the next bit of coding DNA is removed and tagged on the + end of the sequence. In this case this process is represented by + shifting gff coordinates. + """ + shuffles = [] + # get the index of the first element + # get the index of the next thing + # subtract the indices of everything, then reset the ones that are below + # zero + done = False + shuffle_features = self.features[self.features['featType'].isin( + ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True) + # we first add the shuffle features without reorganizing + # print("shuffle\n",shuffle_features) + add_first = copy.deepcopy(self) + add_first.set_features(shuffle_features) + shuffles.append(add_first) + # first gene is changed with every iteration + first_gene = list(shuffle_features['name'])[0] + # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop + absolute_first = list(shuffle_features['name'])[0] + while not done: + # We need to prevent the case of shuffling in the middle of + # overlapped genes. Do this by ensuring that the the start of + # end of first gene is less than the start of the next gene. + first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop']) + next_gene = "" + for next_index in range(1, len(shuffle_features)): + # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index + next_gene_df = list( + shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name']) + if len(next_gene_df) != 0: + next_gene = next_gene_df[next_index] + next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) + #print("looking at {}, prev_stop is {}, start is {}".format( + # next_gene, first_stop, next_start)) + #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]) + # if the gene we're looking at and the next one don't overlap, move on + if first_stop < next_start: + break + #print("next_gene before checking for first is {}".format(next_gene)) + if next_gene == absolute_first: + done = True + break + # now we can reset the first gene for the next iteration + first_gene = next_gene + shuffle_features = shuffle_features.copy(deep=True) + # figure out where the next start point is going to be + next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) + #print('next gene: {}'.format(next_gene)) + shuffle_features['start'] = shuffle_features['start'] - next_start + 1 + shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1 + shuffle_features['center'] = shuffle_features['center'] - next_start + 1 + # now correct the values that are less than 0 + shuffle_features.loc[shuffle_features['start'] < 1, + 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen + shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop'] + < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width'] + shuffle_features['center'] = shuffle_features['start'] + \ + ((shuffle_features['stop'] - shuffle_features['start']) / 2) + shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1) + shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1) + shuffle_features.sort_values(by='start', ascending=True, inplace=True) + shuffle_features.reset_index(inplace=True, drop=True) + new_copy = copy.deepcopy(self) + new_copy.set_features(shuffle_features) + shuffles.append(new_copy) + #print("len shuffles: {}".format(len(shuffles))) + return shuffles + + def couple(self, other_GFF, this_y=0, other_y=1): + """ + Compares this set of features to another set and generates tuples of + (x,y) coordinate pairs to input into lsi + """ + other_features = other_GFF.features + coordinates = [] + for thisname in self.features['name']: + othermatch = other_features.loc[other_features['name'] == thisname, 'center'] + if len(othermatch) == 1: + this_x = float(self.features.loc[self.features['name'] + == thisname, 'center']) # /self.seqlen + other_x = float(othermatch) # /other_GFF.seqlen + # lsi can't handle vertical or horizontal lines, and we don't + # need them either for our comparison. Don't add if equivalent. + if this_x != other_x: + these_coords = ((this_x, this_y), (other_x, other_y)) + coordinates.append(these_coords) + return coordinates + + def _check_triplets(self): + """This method verifies that all entries of featType gene and CDS are + divisible by three""" + genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'") + not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ] + if len(not_trips) > 0: + print_string = "" + print_string += "There are CDS and gene entries that are not divisible by three\n" + print_string += str(not_trips) + warnings.warn(print_string, SyntaxWarning) + + def _get_name(self, tag_value): + """This extracts a name from a single row in 'tags' of the pandas + dataframe + """ + try: + if ";" in tag_value: + name = tag_value[5:].split(';')[0] + else: + name = tag_value[5:].split()[0] + except: + name = tag_value + print("Couldn't correctly parse {}".format( + tag_value)) + return name + + def _determine_lmost(self, row): + """Booleans don't work well for pandas dataframes, so I need to use apply + """ + if row['start'] < row['stop']: + return row['start'] + else: + return row['stop'] + + def _determine_rmost(self, row): + """Booleans don't work well for pandas dataframes, so I need to use apply + """ + if row['start'] < row['stop']: + return row['stop'] + else: + return row['start'] + + +def parse_fastq_length_meanqual(fastq): + """ + arguments: + <fastq> the fastq file path. Hopefully it has been verified to exist already + + purpose: + This function parses a fastq and returns a pandas dataframe of read lengths + and read meanQuals. + """ + # First try to open the file with the gzip package. It will crash + # if the file is not gzipped, so this is an easy way to test if + # the fastq file is gzipped or not. + try: + handle = gzip.open(fastq, "rt") + length, meanQual = _fastq_parse_helper(handle) + except: + handle = open(fastq, "r") + length, meanQual = _fastq_parse_helper(handle) + + handle.close() + df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual']) + return df + + +def filter_fastq_length_meanqual(df, min_len, max_len, + min_mqual, max_mqual): + querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual) + if max_len != None: + querystring += " and length <= {}".format(max_len) + if max_mqual != None: + querystring += " and meanQual <= {}".format(max_mqual) + print("Keeping reads that satisfy: {}".format(querystring), file=stderr) + filtdf = df.query(querystring) + #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce') + #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce') + return filtdf + + +def _fastq_parse_helper(handle): + length = [] + meanQual = [] + for record in SeqIO.parse(handle, "fastq"): + if len(record) > 0: + meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"])) + length.append(len(record)) + return length, meanQual + + +def _geometric_mean(phred_values): + """in case I want geometric mean in the future, can calculate it like this""" + # np.mean(record.letter_annotations["phred_quality"])) + pass + + +def _arithmetic_mean(phred_values): + """ + Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean, + log transform back to Phred. + """ + if not isinstance(phred_values, np.ndarray): + phred_values = np.array(phred_values) + return _erate_to_phred(np.mean(_phred_to_erate(phred_values))) + + +def _phred_to_erate(phred_values): + """ + converts a list or numpy array of phred values to a numpy array + of error rates + """ + if not isinstance(phred_values, np.ndarray): + phred_values = np.array(phred_values) + return np.power(10, (-1 * (phred_values / 10))) + + +def _erate_to_phred(erate_values): + """ + converts a list or numpy array of error rates to a numpy array + of phred values + """ + if not isinstance(erate_values, np.ndarray): + phred_values = np.array(erate_values) + return -10 * np.log10(erate_values) + +def timestamp(): + """ + Returns the current time in :samp:`YYYYMMDD_HHMMSS` format. + """ + return time.strftime("%Y%m%d_%H%M%S") diff --git a/pauvre/gfftools.py b/pauvre/gfftools.py new file mode 100644 index 0000000..09b59bc --- /dev/null +++ b/pauvre/gfftools.py @@ -0,0 +1,581 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +"""This file contains things related to parsing and plotting GFF files""" + +import copy +from matplotlib.path import Path +import matplotlib.patches as patches + +global chevron_width +global arrow_width +global min_text +global text_cutoff + +arrow_width = 80 +chevron_width = 40 +min_text = 550 +text_cutoff = 150 +import sys + +global colorMap +colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', + 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white', + 'ORF':'orange'} + +def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None): + """ plots a left to right patch with introns when there are no intervening + sequences to consider. Uses a gene id and gffutils database as input. + b + a .-=^=-. c + 1__________2---/ e `---1__________2 + | #lff \f d| #lff \ + | left to \3 | left to \3 + | right / | right / + 5___________/4 5___________/4 + """ + #first we need to determine the number of exons + bar_thickness = 0.75 + #now we can start plotting the exons + exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) + for i in range(len(exonlist)): + cds_start = exonlist[i].start + cds_stop = exonlist[i].stop + verts = [(cds_start, y_pos + bar_thickness), #1 + (cds_stop - chevron_width, y_pos + bar_thickness), #2 + (cds_stop, y_pos + (bar_thickness/2)), #3 + (cds_stop - chevron_width, y_pos), #4 + (cds_start, y_pos), #5 + (cds_start, y_pos + bar_thickness), #1 + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap['CDS'] ) + panel.add_patch(patch) + + # we must draw the splice junction + if i < len(exonlist) - 1: + next_start = exonlist[i+1].start + next_stop = exonlist[i+1].stop + middle = cds_stop + ((next_start - cds_stop)/2) + + verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a + (middle, y_pos + 0.95), #b + (next_start, y_pos + bar_thickness), #c + (next_start, y_pos + bar_thickness - 0.05), #d + (middle, y_pos + 0.95 - 0.05), #e + (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f + (cds_stop - chevron_width, y_pos + bar_thickness), #2/a + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap['CDS'] ) + panel.add_patch(patch) + + return panel + +def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None): + """ slightly different from the above version such thatsplice junctions + are more visually explicit. + + plots a left to right patch with introns when there are no intervening + sequences to consider. Uses a gene id and gffutils database as input. + b + a .-=^=-. c + 1_____________2---/ e `---1_____________2 + | #lff /f d| #lff / + | left to / | left to / + | right / | right / + 4_________/3 4_________/3 + """ + #first we need to determine the number of exons + bar_thickness = 0.75 + #now we can start plotting the exons + exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) + for i in range(len(exonlist)): + cds_start = exonlist[i].start + cds_stop = exonlist[i].stop + verts = [(cds_start, y_pos + bar_thickness), #1 + (cds_stop, y_pos + bar_thickness), #2 + (cds_stop - chevron_width, y_pos), #4 + (cds_start, y_pos), #5 + (cds_start, y_pos + bar_thickness), #1 + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap['CDS'] ) + panel.add_patch(patch) + + # we must draw the splice junction + if i < len(exonlist) - 1: + next_start = exonlist[i+1].start + next_stop = exonlist[i+1].stop + middle = cds_stop + ((next_start - cds_stop)/2) + + verts = [(cds_stop-5, y_pos + bar_thickness), #2/a + (middle, y_pos + 0.95), #b + (next_start, y_pos + bar_thickness), #c + (next_start, y_pos + bar_thickness - 0.05), #d + (middle, y_pos + 0.95 - 0.05), #e + (cds_stop-5, y_pos + bar_thickness -0.05), #f + (cds_stop-5, y_pos + bar_thickness), #2/a + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap['CDS'] ) + panel.add_patch(patch) + + return panel + +def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): + """ plots a lff patch + 1__________2 ____________ + | #lff \ \ #rff \ + | left for \3 \ right for \ + | forward / / forward / + 5___________/4 /___________/ + """ + #if there is only one feature to plot, then just plot it + + print("plotting lff") + verts = [(left_df['start'], y_pos + bar_thickness), #1 + (right_df['start'] - chevron_width, y_pos + bar_thickness), #2 + (left_df['stop'], y_pos + (bar_thickness/2)), #3 + (right_df['start'] - chevron_width, y_pos), #4 + (left_df['start'], y_pos), #5 + (left_df['start'], y_pos + bar_thickness), #1 + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap[left_df['featType']] ) + text_width = left_df['width'] + if text and text_width >= min_text: + panel = _plot_label(panel, left_df, y_pos, bar_thickness) + elif text and text_width < min_text and text_width >= text_cutoff: + panel = _plot_label(panel, left_df, + y_pos, bar_thickness, + rotate = True, arrow = True) + + return panel, patch + +def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False): + # handles the case where a dataframe was passed + fontsize = 8 + rotation = 0 + if rotate: + fontsize = 5 + rotation = 90 + if len(df) == 1: + x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start'] + y = y_pos + (bar_thickness/2) + # if we need to center somewhere other than the arrow, need to adjust + # for the direction of the arrow + # it doesn't look good if it shifts by the whole arrow width, so only + # shift by half the arrow width + if arrow: + if df.loc[0, 'strand'] == "+": + shift_start = df.loc[0, 'start'] + else: + shift_start = df.loc[0, 'start'] + (arrow_width/2) + x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start + panel.text(x, y, + df.loc[0, 'name'], fontsize = fontsize, + ha='center', va='center', + color = 'white', family = 'monospace', + zorder = 100, rotation = rotation) + # and the case where a series was passed + else: + x = ((df['stop'] - df['start'])/2) + df['start'] + y = y_pos + (bar_thickness/2) + if arrow: + if df['strand'] == "+": + shift_start = df['start'] + else: + shift_start = df['start'] + (arrow_width/2) + x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start + panel.text(x, y, + df['name'], fontsize = fontsize, + ha='center', va='center', + color = 'white', family = 'monospace', + zorder = 100, rotation = rotation) + + return panel + +def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): + """ plots a rff patch + ____________ 1__________2 + | #lff \ \ #rff \ + | left for \ 6\ right for \3 + | forward / / forward / + |___________/ /5__________/4 + """ + #if there is only one feature to plot, then just plot it + + print("plotting rff") + verts = [(right_df['start'], y_pos + bar_thickness), #1 + (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2 + (right_df['stop'], y_pos + (bar_thickness/2)), #3 + (right_df['stop'] - arrow_width, y_pos), #4 + (right_df['start'], y_pos), #5 + (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6 + (right_df['start'], y_pos + bar_thickness), #1 + ] + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY, + ] + path = Path(verts, codes) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap[right_df['featType']] ) + text_width = right_df['width'] + if text and text_width >= min_text: + panel = _plot_label(panel, right_df, y_pos, bar_thickness) + elif text and text_width < min_text and text_width >= text_cutoff: + panel = _plot_label(panel, right_df, + y_pos, bar_thickness, rotate = True) + return panel, patch + +def x_offset_gff(GFFParseobj, x_offset): + """Takes in a gff object (a gff file parsed as a pandas dataframe), + and an x_offset value and shifts the start, stop, center, lmost, and rmost. + + Returns a GFFParse object with the shifted values in GFFParse.features. + """ + for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']: + GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset + return GFFParseobj + +def gffplot_horizontal(figure, panel, args, gff_object, + track_width=0.2, start_y=0.1, **kwargs): + """ + this plots horizontal things from gff files. it was probably written for synplot, + as the browser does not use this at all. + """ + # Because this size should be relative to the circle that it is plotted next + # to, define the start_radius as the place to work from, and the width of + # each track. + colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', + 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'} + augment = 0 + bar_thickness = 0.9 * track_width + # return these at the end + myPatches=[] + plot_order = [] + + idone = False + # we need to filter out the tRNAs since those are plotted last + plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'") + plottable_features.reset_index(inplace=True, drop=True) + print(plottable_features) + + len_plottable = len(plottable_features) + print('len plottable', len_plottable) + # - this for loop relies on the gff features to already be sorted + # - The algorithm for this loop works by starting at the 0th index of the + # plottable features (i). + # - It then looks to see if the next object (the jth) overlaps with the + # ith element. + i = 0 + j = 1 + while i < len(plottable_features): + if i + j == len(plottable_features): + #we have run off of the df and need to include everything from i to the end + these_features = plottable_features.loc[i::,].copy(deep=True) + these_features = these_features.reset_index() + print(these_features) + plot_order.append(these_features) + i = len(plottable_features) + break + print(" - i,j are currently: {},{}".format(i, j)) + stop = plottable_features.loc[i]["stop"] + start = plottable_features.loc[i+j]["start"] + print("stop: {}. start: {}.".format(stop, start)) + if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]: + print(" - putting elements {} through (including) {} together".format(i, i+j)) + these_features = plottable_features.loc[i:i+j-1,].copy(deep=True) + these_features = these_features.reset_index() + print(these_features) + plot_order.append(these_features) + i += 1 + j = 1 + else: + j += 1 + + #while idone == False: + # print("im in the overlap-pairing while loop i={}".format(i)) + # # look ahead at all of the elements that overlap with the ith element + # jdone = False + # j = 1 + # this_set_minimum_index = i + # this_set_maximum_index = i + # while jdone == False: + # print("new i= {} j={} len={}".format(i, j, len_plottable)) + # print("len plottable in jdone: {}".format(len_plottable)) + # print("plottable features in jdone:\n {}".format(plottable_features)) + # # first make sure that we haven't gone off the end of the dataframe + # # This is an edge case where i has a jth element that overlaps with it, + # # and j is the last element in the plottable features. + # if i+j == len_plottable: + # print("i+j == len_plottable") + # # this checks for the case that i is the last element of the + # # plottable features. + # # In both of the above cases, we are done with both the ith and + # # the jth features. + # if i == len_plottable-1: + # print("i == len_plottable-1") + + # # this is the last analysis, so set idone to true + # # to finish after this + # idone = True + # # the last one can't be in its own group, so just add it solo + # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) + # plot_order.append(these_features.reset_index(drop=True)) + # break + # jdone = True + # else: + # print("i+j != len_plottable") + # # if the lmost of the next gene overlaps with the rmost of + # # the current one, it overlaps and couple together + # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']: + # print("lmost < rmost") + # # note that this feature overlaps with the current + # this_set_maximum_index = i+j + # # ... and we need to look at the next in line + # j += 1 + # else: + # print("lmost !< rmost") + # i += 1 + (this_set_maximum_index - this_set_minimum_index) + # #add all of the things that grouped together once we don't find any more groups + # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) + # plot_order.append(these_features.reset_index(drop=True)) + # jdone = True + # print("plot order is now: {}".format(plot_order)) + # print("jdone: {}".format(str(jdone))) + + for feature_set in plot_order: + # plot_feature_hori handles overlapping cases as well as normal cases + panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap, + start_y, bar_thickness, text = True) + for each in patches: + print("there are {} patches after gffplot_feature_hori".format(len(patches))) + print(each) + myPatches.append(each) + print("length of myPatches is: {}".format(len(myPatches))) + + # Now we add all of the tRNAs to this to plot, do it last to overlay + # everything else + tRNAs = gff_object.features.query("featType == 'tRNA'") + tRNAs.reset_index(inplace=True, drop = True) + tRNA_bar_thickness = bar_thickness * (0.8) + tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2) + for i in range(0,len(tRNAs)): + this_feature = tRNAs[i:i+1].copy(deep=True) + this_feature.reset_index(inplace=True, drop = True) + panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap, + tRNA_start_y, tRNA_bar_thickness, text = True) + for patch in patches: + myPatches.append(patch) + print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches))) + return panel, myPatches + +def gffplot_feature_hori(figure, panel, feature_df, + colorMap, y_pos, bar_thickness, text=True): + """This plots the track for a feature, and if there is something for + 'this_feature_overlaps_feature', then there is special processing to + add the white bar and the extra slope for the chevron + """ + myPatches = [] + #if there is only one feature to plot, then just plot it + if len(feature_df) == 1: + #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1], + # str(feature_df['featType']).split()[1] )) + #print(this_feature['name'], "is not overlapping") + # This plots this shape: 1_________2 2_________1 + # | forward \3 3/ reverse | + # |5__________/4 \4________5| + if feature_df.loc[0,'strand'] == '+': + verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1 + (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2 + (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3 + (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4 + (feature_df.loc[0, 'start'], y_pos), #5 + (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1 + elif feature_df.loc[0,'strand'] == '-': + verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1 + (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2 + (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3 + (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4 + (feature_df.loc[0, 'stop'], y_pos), #5 + (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1 + feat_width = feature_df.loc[0,'width'] + if text and feat_width >= min_text: + panel = _plot_label(panel, feature_df.loc[0,], + y_pos, bar_thickness) + elif text and feat_width < min_text and feat_width >= text_cutoff: + panel = _plot_label(panel, feature_df.loc[0,], + y_pos, bar_thickness, + rotate = True, arrow = True) + + codes = [Path.MOVETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.LINETO, + Path.CLOSEPOLY] + path = Path(verts, codes) + print("normal path is: {}".format(path)) + # If the feature itself is smaller than the arrow, we need to take special measures to + if feature_df.loc[0,'width'] <= arrow_width: + path = Path([verts[i] for i in [0,2,4,5]], + [codes[i] for i in [0,2,4,5]]) + patch = patches.PathPatch(path, lw = 0, + fc=colorMap[feature_df.loc[0, 'featType']] ) + myPatches.append(patch) + # there are four possible scenarios if there are two overlapping sequences: + # ___________ ____________ ____________ ___________ + # | #1 \ \ #1 \ / #2 / / #2 | + # | both seqs \ \ both seqs \ / both seqs / / both seqs | + # | forward / / forward / \ reverse \ \ reverse | + # |__________/ /___________/ \___________\ \___________| + # ___________ _____________ ____________ _ _________ + # | #3 \ \ #3 | / #2 _| #2 \ + # | one seq \ \ one seq | / one seq |_ one seq \ + # | forward \ \ reverse | \ reverse _| forward / + # |_____________\ \_________| \__________|_ ___________/ + # + # These different scenarios can be thought of as different left/right + # flanking segment types. + # In the annotation #rff: + # - 'r' refers to the annotation type as being on the right + # - the first 'f' refers to the what element is to the left of this one. + # Since it is forward the 5' end of this annotation must be a chevron + # - the second 'f' refers to the right side of this element. Since it is + # forward it must be a normal arrow. + # being on the right + # + # *LEFT TYPES* *RIGHT TYPES* + # ____________ ____________ + # | #lff \ \ #rff \ + # | left for \ \ right for \ + # | forward / / forward / + # |___________/ /___________/ + # ___________ _____________ + # | #lfr \ \ #rfr | + # | left for \ \ right for | + # | reverse \ \ reverse | + # |_____________\ \_________| + # ____________ ___________ + # / #lrr / / #rrr | + # / left rev / / right rev | + # \ reverse \ \ reverse | + # \___________\ \___________| + # ____________ __________ + # / #lrf _| _| #rrf \ + # / left rev |_ | _ right rev \ + # \ forward _| _| forward / + # \__________| |____________/ + # + # To properly plot these elements, we must go through each element of the + # feature_df to determine which patch type it is. + elif len(feature_df) == 2: + print("im in here feat len=2") + for i in range(len(feature_df)): + # this tests for which left type we're dealing with + if i == 0: + # type could be lff or lfr + if feature_df.loc[i, 'strand'] == '+': + if feature_df.loc[i + 1, 'strand'] == '+': + # plot a lff type + panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,], + colorMap, y_pos, bar_thickness, text) + myPatches.append(patch) + elif feature_df.loc[i + 1, 'strand'] == '-': + #plot a lfr type + raise IOError("can't plot {} patches yet".format("lfr")) + # or type could be lrr or lrf + elif feature_df.loc[i, 'strand'] == '-': + if feature_df.loc[i + 1, 'strand'] == '+': + # plot a lrf type + raise IOError("can't plot {} patches yet".format("lrf")) + elif feature_df.loc[i + 1, 'strand'] == '-': + #plot a lrr type + raise IOError("can't plot {} patches yet".format("lrr")) + # in this case we're only dealing with 'right type' patches + elif i == len(feature_df) - 1: + # type could be rff or rfr + if feature_df.loc[i-1, 'strand'] == '+': + if feature_df.loc[i, 'strand'] == '+': + # plot a rff type + panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,], + colorMap, y_pos, bar_thickness, text) + myPatches.append(patch) + elif feature_df.loc[i, 'strand'] == '-': + #plot a rfr type + raise IOError("can't plot {} patches yet".format("rfr")) + # or type could be rrr or rrf + elif feature_df.loc[i-1, 'strand'] == '-': + if feature_df.loc[i, 'strand'] == '+': + # plot a rrf type + raise IOError("can't plot {} patches yet".format("rrf")) + elif feature_df.loc[i, 'strand'] == '-': + #plot a rrr type + raise IOError("can't plot {} patches yet".format("rrr")) + return panel, myPatches diff --git a/pauvre/lsi/Q.py b/pauvre/lsi/Q.py new file mode 100755 index 0000000..96563e4 --- /dev/null +++ b/pauvre/lsi/Q.py @@ -0,0 +1,77 @@ +# Binary search tree that holds status of sweep line. Only leaves hold values. +# Operations for finding left and right neighbors of a query point p and finding which segments contain p. +# Author: Sam Lichtenberg +# Email: splichte@princeton.edu +# Date: 09/02/2013 + +from pauvre.lsi.helper import * + +ev = 0.00000001 + +class Q: + def __init__(self, key, value): + self.key = key + self.value = value + self.left = None + self.right = None + + def find(self, key): + if self.key is None: + return False + c = compare_by_y(key, self.key) + if c==0: + return True + elif c==-1: + if self.left: + self.left.find(key) + else: + return False + else: + if self.right: + self.right.find(key) + else: + return False + def insert(self, key, value): + if self.key is None: + self.key = key + self.value = value + c = compare_by_y(key, self.key) + if c==0: + self.value += value + elif c==-1: + if self.left is None: + self.left = Q(key, value) + else: + self.left.insert(key, value) + else: + if self.right is None: + self.right = Q(key, value) + else: + self.right.insert(key, value) + # must return key AND value + def get_and_del_min(self, parent=None): + if self.left is not None: + return self.left.get_and_del_min(self) + else: + k = self.key + v = self.value + if parent: + parent.left = self.right + # i.e. is root node + else: + if self.right: + self.key = self.right.key + self.value = self.right.value + self.left = self.right.left + self.right = self.right.right + else: + self.key = None + return k,v + + def print_tree(self): + if self.left: + self.left.print_tree() + print(self.key) + print(self.value) + if self.right: + self.right.print_tree() diff --git a/pauvre/lsi/T.py b/pauvre/lsi/T.py new file mode 100755 index 0000000..06554d7 --- /dev/null +++ b/pauvre/lsi/T.py @@ -0,0 +1,322 @@ +# Binary search tree that holds status of sweep line. Only leaves hold values. +# Operations for finding left and right neighbors of a query point p and finding which segments contain p. +# Author: Sam Lichtenberg +# Email: splichte@princeton.edu +# Date: 09/02/2013 + +from pauvre.lsi.helper import * + +ev = 0.00000001 + +class T: + def __init__(self): + self.root = Node(None, None, None, None) + def contain_p(self, p): + if self.root.value is None: + return [[], []] + lists = [[], []] + self.root.contain_p(p, lists) + return (lists[0], lists[1]) + def get_left_neighbor(self, p): + if self.root.value is None: + return None + return self.root.get_left_neighbor(p) + def get_right_neighbor(self, p): + if self.root.value is None: + return None + return self.root.get_right_neighbor(p) + def insert(self, key, s): + if self.root.value is None: + self.root.left = Node(s, None, None, self.root) + self.root.value = s + self.root.m = get_slope(s) + else: + (node, path) = self.root.find_insert_pt(key, s) + if path == 'r': + node.right = Node(s, None, None, node) + node.right.adjust() + elif path == 'l': + node.left = Node(s, None, None, node) + else: + # this means matching Node was a leaf + # need to make a new internal Node + if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1): + new_internal = Node(s, None, node, node.parent) + new_leaf = Node(s, None, None, new_internal) + new_internal.left = new_leaf + if node is node.parent.left: + node.parent.left = new_internal + node.adjust() + else: + node.parent.right = new_internal + else: + new_internal = Node(node.value, node, None, node.parent) + new_leaf = Node(s, None, None, new_internal) + new_internal.right = new_leaf + if node is node.parent.left: + node.parent.left = new_internal + new_leaf.adjust() + else: + node.parent.right = new_internal + node.parent = new_internal + + def delete(self, p, s): + key = p + node = self.root.find_delete_pt(key, s) + val = node.value + if node is node.parent.left: + parent = node.parent.parent + if parent is None: + if self.root.right is not None: + if self.root.right.left or self.root.right.right: + self.root = self.root.right + self.root.parent = None + else: + self.root.left = self.root.right + self.root.value = self.root.right.value + self.root.m = self.root.right.m + self.root.right = None + else: + self.root.left = None + self.root.value = None + elif node.parent is parent.left: + parent.left = node.parent.right + node.parent.right.parent = parent + else: + parent.right = node.parent.right + node.parent.right.parent = parent + else: + parent = node.parent.parent + if parent is None: + if self.root.left: + # switch properties + if self.root.left.right or self.root.left.left: + self.root = self.root.left + self.root.parent = None + else: + self.root.right = None + else: + self.root.right = None + self.root.value = None + elif node.parent is parent.left: + parent.left = node.parent.left + node.parent.left.parent = parent + farright = node.parent.left + while farright.right is not None: + farright = farright.right + farright.adjust() + else: + parent.right = node.parent.left + node.parent.left.parent = parent + farright = node.parent.left + while farright.right is not None: + farright = farright.right + farright.adjust() + return val + + def print_tree(self): + self.root.print_tree() +class Node: + def __init__(self, value, left, right, parent): + self.value = value # associated line segment + self.left = left + self.right = right + self.parent = parent + self.m = None + if value is not None: + self.m = get_slope(value) + + # compares line segment at y-val of p to p + # TODO: remove this and replace with get_x_at + def compare_to_key(self, p): + x0 = self.value[0][0] + y0 = self.value[0][1] + y1 = p[1] + if self.m != 0 and self.m is not None: + x1 = x0 - float(y0-y1)/self.m + return compare_by_x(p, (x1, y1)) + else: + x1 = p[0] + return 0 + + def get_left_neighbor(self, p): + neighbor = None + n = self + if n.left is None and n.right is None: + return neighbor + last_right = None + found = False + while not found: + c = n.compare_to_key(p) + if c < 1 and n.left: + n = n.left + elif c==1 and n.right: + n = n.right + last_right = n.parent + else: + found = True + c = n.compare_to_key(p) + if c==0: + if n is n.parent.right: + return n.parent + else: + goright = None + if last_right: + goright =last_right.left + return self.get_lr(None, goright)[0] + # n stores the highest-value in the left subtree + if c==-1: + goright = None + if last_right: + goright = last_right.left + return self.get_lr(None, goright)[0] + if c==1: + neighbor = n + return neighbor + + def get_right_neighbor(self, p): + neighbor = None + n = self + if n.left is None and n.right is None: + return neighbor + last_left = None + found = False + while not found: + c = n.compare_to_key(p) + if c==0 and n.right: + n = n.right + elif c < 0 and n.left: + n = n.left + last_left = n.parent + elif c==1 and n.right: + n = n.right + else: + found = True + c = n.compare_to_key(p) + # can be c==0 and n.left if at root node + if c==0: + if n.parent is None: + return None + if n is n.parent.right: + goleft = None + if last_left: + goleft = last_left.right + return self.get_lr(goleft, None)[1] + else: + return self.get_lr(n.parent.right, None)[1] + if c==1: + goleft = None + if last_left: + goleft = last_left.right + return self.get_lr(goleft, None)[1] + if c==-1: + return n + return neighbor + + # travels down a single direction to get neighbors + def get_lr(self, left, right): + lr = [None, None] + if left: + while left.left: + left = left.left + lr[1] = left + if right: + while right.right: + right = right.right + lr[0] = right + return lr + + def contain_p(self, p, lists): + c = self.compare_to_key(p) + if c==0: + if self.left is None and self.right is None: + if compare_by_x(p, self.value[1])==0: + lists[1].append(self.value) + else: + lists[0].append(self.value) + if self.left: + self.left.contain_p(p, lists) + if self.right: + self.right.contain_p(p, lists) + elif c < 0: + if self.left: + self.left.contain_p(p, lists) + else: + if self.right: + self.right.contain_p(p, lists) + + def find_insert_pt(self, key, seg): + if self.left and self.right: + if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: + return self.right.find_insert_pt(key, seg) + elif self.compare_to_key(key) < 1: + return self.left.find_insert_pt(key, seg) + else: + return self.right.find_insert_pt(key, seg) + # this case only happens at root + elif self.left: + if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: + return (self, 'r') + elif self.compare_to_key(key) < 1: + return self.left.find_insert_pt(key, seg) + else: + return (self, 'r') + else: + return (self, 'n') + + # adjusts stored segments in inner nodes + def adjust(self): + value = self.value + m = self.m + parent = self.parent + node = self + # go up left as much as possible + while parent and node is parent.right: + node = parent + parent = node.parent + # parent to adjust will be on the immediate right + if parent and node is parent.left: + parent.value = value + parent.m = m + + def compare_lower(self, p, s2): + y = p[1] - 10 + key = get_x_at(s2, (p[0], y)) + return self.compare_to_key(key) + + # returns matching leaf node, or None if no match + # when deleting, you don't delete below--you delete above! so compare lower = -1. + def find_delete_pt(self, key, value): + if self.left and self.right: + # if equal at this pt, and this node's value is less than the seg's slightly above this pt + if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1: + return self.right.find_delete_pt(key, value) + if self.compare_to_key(key) < 1: + return self.left.find_delete_pt(key, value) + else: + return self.right.find_delete_pt(key, value) + elif self.left: + if self.compare_to_key(key) < 1: + return self.left.find_delete_pt(key, value) + else: + return None + # is leaf + else: + if self.compare_to_key(key)==0 and segs_equal(self.value, value): + return self + else: + return None + + # also prints depth of each node + def print_tree(self, l=0): + l += 1 + if self.left: + self.left.print_tree(l) + if self.left or self.right: + print('INTERNAL: {0}'.format(l)) + else: + print('LEAF: {0}'.format(l)) + print(self) + print(self.value) + if self.right: + self.right.print_tree(l) diff --git a/pauvre/lsi/__init__.py b/pauvre/lsi/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/pauvre/lsi/__init__.py diff --git a/pauvre/lsi/helper.py b/pauvre/lsi/helper.py new file mode 100755 index 0000000..dac30ec --- /dev/null +++ b/pauvre/lsi/helper.py @@ -0,0 +1,94 @@ +# Helper functions for use in the lsi implementation. + +ev = 0.0000001 +# floating-point comparison +def approx_equal(a, b, tol): + return abs(a - b) < tol + +# compares x-values of two pts +# used for ordering in T +def compare_by_x(k1, k2): + if approx_equal(k1[0], k2[0], ev): + return 0 + elif k1[0] < k2[0]: + return -1 + else: + return 1 + +# higher y value is "less"; if y value equal, lower x value is "less" +# used for ordering in Q +def compare_by_y(k1, k2): + if approx_equal(k1[1], k2[1], ev): + if approx_equal(k1[0], k2[0], ev): + return 0 + elif k1[0] < k2[0]: + return -1 + else: + return 1 + elif k1[1] > k2[1]: + return -1 + else: + return 1 + +# tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders) +def segs_equal(s0, s1): + x00 = s0[0][0] + y00 = s0[0][1] + x01 = s0[1][0] + y01 = s0[1][1] + x10 = s1[0][0] + y10 = s1[0][1] + x11 = s1[1][0] + y11 = s1[1][1] + if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)): + if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)): + return True + if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)): + if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)): + return True + return False + +# get m for a given seg in (p1, p2) form +def get_slope(s): + x0 = s[0][0] + y0 = s[0][1] + x1 = s[1][0] + y1 = s[1][1] + if (x1-x0)==0: + return None + else: + return float(y1-y0)/(x1-x0) + +# given a point p, return the point on s that shares p's y-val +def get_x_at(s, p): + m = get_slope(s) + # TODO: this should check if p's x-val is octually on seg; we're assuming + # for now that it would have been deleted already if not + if m == 0: # horizontal segment + return p + # ditto; should check if y-val on seg + if m is None: # vertical segment + return (s[0][0], p[1]) + x1 = s[0][0]-(s[0][1]-p[1])/m + return (x1, p[1]) + +# returns the point at which two line segments intersect, or None if no intersection. +def intersect(seg1, seg2): + p = seg1[0] + r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1]) + q = seg2[0] + s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1]) + denom = r[0]*s[1]-r[1]*s[0] + if denom == 0: + return None + numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0] + t = numer/denom + numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0] + u = numer/denom + if (t < 0 or t > 1) or (u < 0 or u > 1): + return None + x = p[0]+t*r[0] + y = p[1]+t*r[1] + return (x, y) + + diff --git a/pauvre/lsi/lsi.py b/pauvre/lsi/lsi.py new file mode 100755 index 0000000..cb2f533 --- /dev/null +++ b/pauvre/lsi/lsi.py @@ -0,0 +1,107 @@ +# Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2. +# See README for more information. +# Author: Sam Lichtenberg +# Email: splichte@princeton.edu +# Date: 09/02/2013 + +from pauvre.lsi.Q import Q +from pauvre.lsi.T import T +from pauvre.lsi.helper import * + +# "close enough" for floating point +ev = 0.00000001 + +# how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left +lower_check = 100 + +# gets the point on a segment at a lower y value. +def getNextPoint(p, seg, y_lower): + p1 = seg[0] + p2 = seg[1] + if (p1[0]-p2[0])==0: + return (p[0]+10, p[1]) + slope = float(p1[1]-p2[1])/(p1[0]-p2[0]) + if slope==0: + return (p1[0], p[1]-y_lower) + y = p[1]-y_lower + x = p1[0]-(p1[1]-y)/slope + return (x, y) + +""" +for each event point: + U_p = segments that have p as an upper endpoint + C_p = segments that contain p + L_p = segments that have p as a lower endpoint +""" +def handle_event_point(p, segs, q, t, intersections): + rightmost = (float("-inf"), 0) + rightmost_seg = None + leftmost = (float("inf"), 0) + leftmost_seg = None + + U_p = segs + (C_p, L_p) = t.contain_p(p) + merge_all = U_p+C_p+L_p + if len(merge_all) > 1: + intersections[p] = [] + for s in merge_all: + intersections[p].append(s) + merge_CL = C_p+L_p + merge_UC = U_p+C_p + for s in merge_CL: + # deletes at a point slightly above (to break ties) - where seg is located in tree + # above intersection point + t.delete(p, s) + # put segments into T based on where they are at y-val just below p[1] + for s in merge_UC: + n = getNextPoint(p, s, lower_check) + if n[0] > rightmost[0]: + rightmost = n + rightmost_seg = s + if n[0] < leftmost[0]: + leftmost = n + leftmost_seg = s + t.insert(p, s) + + # means only L_p -> check newly-neighbored segments + if len(merge_UC) == 0: + neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p)) + if neighbors[0] and neighbors[1]: + find_new_event(neighbors[0].value, neighbors[1].value, p, q) + + # of newly inserted pts, find possible intersections to left and right + else: + left_neighbor = t.get_left_neighbor(p) + if left_neighbor: + find_new_event(left_neighbor.value, leftmost_seg, p, q) + right_neighbor = t.get_right_neighbor(p) + if right_neighbor: + find_new_event(right_neighbor.value, rightmost_seg, p, q) + +def find_new_event(s1, s2, p, q): + i = intersect(s1, s2) + if i: + if compare_by_y(i, p) == 1: + if not q.find(i): + q.insert(i, []) + +# segment is in ((x, y), (x, y)) form +# first pt in a segment should have higher y-val - this is handled in function +def intersection(S): + s0 = S[0] + if s0[1][1] > s0[0][1]: + s0 = (s0[1], s0[0]) + q = Q(s0[0], [s0]) + q.insert(s0[1], []) + intersections = {} + for s in S[1:]: + if s[1][1] > s[0][1]: + s = (s[1], s[0]) + q.insert(s[0], [s]) + q.insert(s[1], []) + t = T() + while q.key: + p, segs = q.get_and_del_min() + handle_event_point(p, segs, q, t, intersections) + return intersections + diff --git a/pauvre/lsi/test.py b/pauvre/lsi/test.py new file mode 100755 index 0000000..8bae308 --- /dev/null +++ b/pauvre/lsi/test.py @@ -0,0 +1,110 @@ +# Test file for lsi. +# Author: Sam Lichtenberg +# Email: splichte@princeton.edu +# Date: 09/02/2013 + +from lsi import intersection +import random +import time, sys +from helper import * + +ev = 0.00000001 + +def scale(i): + return float(i) + +use_file = None +try: + use_file = sys.argv[2] +except: + pass + +if not use_file: + S = [] + for i in range(int(sys.argv[1])): + p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) + p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) + s = (p1, p2) + S.append(s) + f = open('input', 'w') + f.write(str(S)) + f.close() + +else: + f = open(sys.argv[2], 'r') + S = eval(f.read()) + +intersections = [] +seen = [] +vs = False +hs = False +es = False +now = time.time() +for seg1 in S: + if approx_equal(seg1[0][0], seg1[1][0], ev): + print 'VERTICAL SEG' + print '' + print '' + vs = True + if approx_equal(seg1[0][1], seg1[1][1], ev): + print 'HORIZONTAL SEG' + print '' + print '' + hs = True + for seg2 in S: + if seg1 is not seg2 and segs_equal(seg1, seg2): + print 'EQUAL SEGS' + print '' + print '' + es = True + if seg1 is not seg2 and (seg2, seg1) not in seen: + i = intersect(seg1, seg2) + if i: + intersections.append((i, [seg1, seg2])) + # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]] + # xpts = sorted(xpts) + # if (i[0] <= xpts[2] and i[0] >= xpts[1]: + # intersections.append((i, [seg1, seg2])) + seen.append((seg1, seg2)) +later = time.time() +n2time = later-now +print "Line sweep results:" +now = time.time() +lsinters = intersection(S) +inters = [] +for k, v in lsinters.iteritems(): + #print '{0}: {1}'.format(k, v) + inters.append(k) +# inters.append(v) +later = time.time() +print 'TIME ELAPSED: {0}'.format(later-now) +print "N^2 comparison results:" +pts_seen = [] +highestseen = 0 +for i in intersections: + seen_already = False + seen = 0 + for p in pts_seen: + if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev): + seen += 1 + seen_already = True + if seen > highestseen: + highestseen = seen + if not seen_already: + pts_seen.append(i[0]) + in_k = False + for k in inters: + if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev): + in_k = True + if in_k == False: + print 'Not in K: {0}: {1}'.format(i[0], i[1]) +# print i +print highestseen +print 'TIME ELAPSED: {0}'.format(n2time) +#print 'Missing from line sweep but in N^2:' +#for i in seen: +# matched = False +print len(lsinters) +print len(pts_seen) +if len(lsinters) != len(pts_seen): + print 'uh oh!' diff --git a/pauvre/marginplot.py b/pauvre/marginplot.py new file mode 100644 index 0000000..53da1ae --- /dev/null +++ b/pauvre/marginplot.py @@ -0,0 +1,401 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre +# Copyright (c) 2016-2020 Darrin T. Schultz. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +import ast +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.patches as mplpatches +from matplotlib.colors import LinearSegmentedColormap +import numpy as np +import pandas as pd +import os.path as opath +from sys import stderr +from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual +from pauvre.stats import stats +import pauvre.rcparams as rc +import logging + +# logging +logger = logging.getLogger('pauvre') + + +def generate_panel(panel_left, panel_bottom, panel_width, panel_height, + axis_tick_param='both', which_tick_param='both', + bottom_tick_param=True, label_bottom_tick_param=True, + left_tick_param=True, label_left_tick_param=True, + right_tick_param=False, label_right_tick_param=False, + top_tick_param=False, label_top_tick_param=False): + """ + Setting default panel tick parameters. Some of these are the defaults + for matplotlib anyway, but specifying them for readability. Here are + options and defaults for the parameters used below: + + axis : {'x', 'y', 'both'}; which axis to modify; default = 'both' + which : {'major', 'minor', 'both'}; which ticks to modify; + default = 'major' + bottom, top, left, right : bool or {True, False}; ticks on or off; + labelbottom, labeltop, labelleft, labelright : bool or {True, False} + """ + + # create the panel + panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height] + panel = plt.axes(panel_rectangle) + + # Set tick parameters + panel.tick_params(axis=axis_tick_param, + which=which_tick_param, + bottom=bottom_tick_param, + labelbottom=label_bottom_tick_param, + left=left_tick_param, + labelleft=label_left_tick_param, + right=right_tick_param, + labelright=label_right_tick_param, + top=top_tick_param, + labeltop=label_top_tick_param) + + return panel + + +def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): + """This helper method generates the histogram that is added to the panel. + + In this case, horizontal = True applies to the mean quality histogram. + So, horizontal = False only applies to the length histogram. + """ + l_width = 0.0 + f_color = (0.5, 0.5, 0.5) + e_color = (0, 0, 0) + if horizontal: + for step in np.arange(0, len(bin_values), 1): + left = bins[step] + bottom = 0 + width = bins[step + 1] - bins[step] + height = bin_values[step] + hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, + linewidth=l_width, + facecolor=f_color, + edgecolor=e_color) + panel.add_patch(hist_rectangle) + else: + for step in np.arange(0, len(bin_values), 1): + left = 0 + bottom = bins[step] + width = bin_values[step] + height = bins[step + 1] - bins[step] + + hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, + linewidth=l_width, + facecolor=f_color, + edgecolor=e_color) + panel.add_patch(hist_rectangle) + + +def generate_histogram(panel, data_list, max_plot_length, min_plot_length, + bin_interval, hist_horizontal=True, + left_spine=True, bottom_spine=True, + top_spine=False, right_spine=False, x_label=None, + y_label=None): + + bins = np.arange(0, max_plot_length, bin_interval) + + bin_values, bins2 = np.histogram(data_list, bins) + + # hist_horizontal is used for quality + if hist_horizontal: + panel.set_xlim([min_plot_length, max_plot_length]) + panel.set_ylim([0, max(bin_values * 1.1)]) + # and hist_horizontal == Fale is for read length + else: + panel.set_xlim([0, max(bin_values * 1.1)]) + panel.set_ylim([min_plot_length, max_plot_length]) + + # Generate histogram bin patches, depending on whether we're plotting + # vertically or horizontally + _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal) + + panel.spines['left'].set_visible(left_spine) + panel.spines['bottom'].set_visible(bottom_spine) + panel.spines['top'].set_visible(top_spine) + panel.spines['right'].set_visible(right_spine) + + if y_label is not None: + panel.set_ylabel(y_label) + if x_label is not None: + panel.set_xlabel(x_label) + + +def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual, + max_plot_length, max_plot_qual, color, **kwargs): + panel.set_xlim([min_plot_qual, max_plot_qual]) + panel.set_ylim([min_plot_length, max_plot_length]) + + if kwargs["kmerdf"]: + hex_this = data_frame.query('length<{} and numks<{}'.format( + max_plot_length, max_plot_qual)) + # This single line controls plotting the hex bins in the panel + hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)), + linewidths=0.0, cmap=color) + + else: + hex_this = data_frame.query('length<{} and meanQual<{}'.format( + max_plot_length, max_plot_qual)) + + # This single line controls plotting the hex bins in the panel + hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49, + linewidths=0.0, cmap=color) + for each in panel.spines: + panel.spines[each].set_visible(False) + + counts = hex_vals.get_array() + return counts + + +def generate_legend(panel, counts, color): + + # completely custom for more control + panel.set_xlim([0, 1]) + panel.set_ylim([0, 1000]) + panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)]) + panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)]) + for i in np.arange(0, 1001, 1): + rgba = color(i / 1001) + alpha = rgba[-1] + facec = rgba[0:3] + hist_rectangle = mplpatches.Rectangle((0, i), 1, 1, + linewidth=0.0, + facecolor=facec, + edgecolor=(0, 0, 0), + alpha=alpha) + panel.add_patch(hist_rectangle) + panel.spines['top'].set_visible(False) + panel.spines['left'].set_visible(False) + panel.spines['bottom'].set_visible(False) + panel.yaxis.set_label_position("right") + panel.set_ylabel('Number of Reads') + + +def margin_plot(df, **kwargs): + rc.update_rcParams() + + # 250, 231, 34 light yellow + # 67, 1, 85 + # R=np.linspace(65/255,1,101) + # G=np.linspace(0/255, 231/255, 101) + # B=np.linspace(85/255, 34/255, 101) + # R=65/255, G=0/255, B=85/255 + Rf = 65 / 255 + Bf = 85 / 255 + pdict = {'red': ((0.0, Rf, Rf), + (1.0, Rf, Rf)), + 'green': ((0.0, 0.0, 0.0), + (1.0, 0.0, 0.0)), + 'blue': ((0.0, Bf, Bf), + (1.0, Bf, Bf)), + 'alpha': ((0.0, 0.0, 0.0), + (1.0, 1.0, 1.0)) + } + # Now we will use this example to illustrate 3 ways of + # handling custom colormaps. + # First, the most direct and explicit: + purple1 = LinearSegmentedColormap('Purple1', pdict) + + # set the figure dimensions + fig_width = 1.61 * 3 + fig_height = 1 * 3 + fig = plt.figure(figsize=(fig_width, fig_height)) + + # set the panel dimensions + heat_map_panel_width = fig_width * 0.5 + heat_map_panel_height = heat_map_panel_width * 0.62 + + # find the margins to center the panel in figure + fig_left_margin = fig_bottom_margin = (1 / 6) + + # lengthPanel + length_panel_width = (1 / 8) + + # the color Bar parameters + legend_panel_width = (1 / 24) + + # define padding + h_padding = 0.02 + v_padding = 0.05 + + # Set whether to include y-axes in histograms + if kwargs["Y_AXES"]: + length_bottom_spine = True + length_bottom_tick = False + length_bottom_label = True + qual_left_spine = True + qual_left_tick = True + qual_left_label = True + qual_y_label = 'Count' + else: + length_bottom_spine = False + length_bottom_tick = False + length_bottom_label = False + qual_left_spine = False + qual_left_tick = False + qual_left_label = False + qual_y_label = None + + panels = [] + + # Quality histogram panel + qual_panel_left = fig_left_margin + length_panel_width + h_padding + qual_panel_width = heat_map_panel_width / fig_width + qual_panel_height = length_panel_width * fig_width / fig_height + qual_panel = generate_panel(qual_panel_left, + fig_bottom_margin, + qual_panel_width, + qual_panel_height, + left_tick_param=qual_left_tick, + label_left_tick_param=qual_left_label) + panels.append(qual_panel) + + # Length histogram panel + length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding + length_panel_height = heat_map_panel_height / fig_height + length_panel = generate_panel(fig_left_margin, + length_panel_bottom, + length_panel_width, + length_panel_height, + bottom_tick_param=length_bottom_tick, + label_bottom_tick_param=length_bottom_label) + panels.append(length_panel) + + # Heat map panel + heat_map_panel_left = fig_left_margin + length_panel_width + h_padding + heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding + + heat_map_panel = generate_panel(heat_map_panel_left, + heat_map_panel_bottom, + heat_map_panel_width / fig_width, + heat_map_panel_height / fig_height, + bottom_tick_param=False, + label_bottom_tick_param=False, + left_tick_param=False, + label_left_tick_param=False) + panels.append(heat_map_panel) + heat_map_panel.set_title(kwargs["title"]) + + # Legend panel + legend_panel_left = fig_left_margin + length_panel_width + \ + (heat_map_panel_width / fig_width) + (h_padding * 2) + legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding + legend_panel_height = heat_map_panel_height / fig_height + legend_panel = generate_panel(legend_panel_left, legend_panel_bottom, + legend_panel_width, legend_panel_height, + bottom_tick_param = False, + label_bottom_tick_param = False, + left_tick_param = False, + label_left_tick_param = False, + right_tick_param = True, + label_right_tick_param = True) + panels.append(legend_panel) + + # Set min and max viewing window for length + if kwargs["plot_maxlen"]: + max_plot_length = kwargs["plot_maxlen"] + else: + max_plot_length = int(np.percentile(df['length'], 99)) + min_plot_length = kwargs["plot_minlen"] + + # Set length bin sizes + if kwargs["lengthbin"]: + length_bin_interval = kwargs["lengthbin"] + else: + # Dividing by 80 is based on what looks good from experience + length_bin_interval = int(max_plot_length / 80) + + # length_bins = np.arange(0, max_plot_length, length_bin_interval) + + # Set max and min viewing window for quality + if kwargs["plot_maxqual"]: + max_plot_qual = kwargs["plot_maxqual"] + elif kwargs["kmerdf"]: + max_plot_qual = np.ceil(df["numks"].median() * 2) + else: + max_plot_qual = max(np.ceil(df['meanQual'])) + min_plot_qual = kwargs["plot_minqual"] + + # Set qual bin sizes + if kwargs["qualbin"]: + qual_bin_interval = kwargs["qualbin"] + elif kwargs["kmerdf"]: + qual_bin_interval = 1 + else: + # again, this is just based on what looks good from experience + qual_bin_interval = max_plot_qual / 85 + qual_bins = np.arange(0, max_plot_qual, qual_bin_interval) + + # Generate length histogram + generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length, + length_bin_interval, hist_horizontal=False, + y_label='Read Length', bottom_spine=length_bottom_spine) + + # Generate quality histogram + if kwargs["kmerdf"]: + generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual, + qual_bin_interval, x_label='number of kmers', + y_label=qual_y_label, left_spine=qual_left_spine) + else: + generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual, + qual_bin_interval, x_label='Phred Quality', + y_label=qual_y_label, left_spine=qual_left_spine) + + # Generate heat map + counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual, + max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"]) + + # Generate legend + generate_legend(legend_panel, counts, purple1) + + # inform the user of the plotting window if not quiet mode + if not kwargs["QUIET"]: + print("""plotting in the following window: + {0} <= Q-score (x-axis) <= {1} + {2} <= length (y-axis) <= {3}""".format( + min_plot_qual, max_plot_qual, min_plot_length, max_plot_length), + file=stderr) + # Print image(s) + if kwargs["BASENAME"] is None: + file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0] + else: + file_base = kwargs["BASENAME"] + print_images( + file_base, + image_formats=kwargs["fileform"], + dpi=kwargs["dpi"], + no_timestamp = kwargs["no_timestamp"], + transparent=kwargs["TRANSPARENT"]) + +def run(args): + if args.kmerdf: + df = pd.read_csv(args.kmerdf, header='infer', sep='\t') + df["kmers"] = df["kmers"].apply(ast.literal_eval) + else: + df = parse_fastq_length_meanqual(args.fastq) + df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen, + args.filt_minqual, args.filt_maxqual) + stats(df, args.fastq, False) + margin_plot(df=df.dropna(), **vars(args)) diff --git a/pauvre/pauvre_main.py b/pauvre/pauvre_main.py new file mode 100644 index 0000000..2a5fb91 --- /dev/null +++ b/pauvre/pauvre_main.py @@ -0,0 +1,636 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore plotting package +# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS + +import sys +import os.path +import argparse + +# pauvre imports +import pauvre.version +# This class is used in argparse to expand the ~. This avoids errors caused on +# some systems. + + +class FullPaths(argparse.Action): + """Expand user- and relative-paths""" + + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, + os.path.abspath(os.path.expanduser(values))) + + +class FullPathsList(argparse.Action): + """Expand user- and relative-paths when a list of paths is passed to the + program""" + + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, + [os.path.abspath(os.path.expanduser(value)) for value in values]) + +def run_subtool(parser, args): + if args.command == 'browser': + import pauvre.browser as submodule + elif args.command == 'custommargin': + import pauvre.custommargin as submodule + elif args.command == 'marginplot': + import pauvre.marginplot as submodule + elif args.command == 'redwood': + import pauvre.redwood as submodule + elif args.command == 'stats': + import pauvre.stats as submodule + elif args.command == 'synplot': + import pauvre.synplot as submodule + # run the chosen submodule. + submodule.run(args) + + +class ArgumentParserWithDefaults(argparse.ArgumentParser): + def __init__(self, *args, **kwargs): + super(ArgumentParserWithDefaults, self).__init__(*args, **kwargs) + self.add_argument("-q", "--quiet", help="Do not output warnings to stderr", + action="store_true", + dest="QUIET") + +def main(): + + ######################################### + # create the top-level parser + ######################################### + parser = argparse.ArgumentParser( + prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter) + parser.add_argument("-v", "--version", help="Installed pauvre version", + action="version", + version="%(prog)s " + str(pauvre.version.__version__)) + subparsers = parser.add_subparsers( + title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults) + + ######################################### + # create the individual tool parsers + ######################################### + + ############# + # browser + ############# + parser_browser = subparsers.add_parser('browser', + help="""an adaptable genome + browser with various track types""") + parser_browser.add_argument('-c', '--chromosomeid', + metavar = "Chr", + dest = 'CHR', + type = str, + help = """The fasta sequence to observe. + Use the header name of the fasta file + without the '>' character""") + parser_browser.add_argument('--dpi', + metavar='dpi', + default=600, + type=int, + help="""Change the dpi from the default 600 + if you need it higher""") + parser_browser.add_argument('--fileform', + dest='fileform', + metavar='STRING', + choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', + 'pdf', 'pgf', 'ps', 'raw', 'rgba', + 'svg', 'svgz', 'tif', 'tiff'], + default=['png'], + nargs='+', + help='Which output format would you like? Def.=png') + parser_browser.add_argument("--no_timestamp", + action = 'store_true', + help="""Turn off time stamps in the filename + output.""") + parser_browser.add_argument('-o', '--output-base-name', + dest='BASENAME', + help="""Specify a base name for the output file( + s). The input file base name is the + default.""") + parser_browser.add_argument('--path', + type=str, + help="""Set an explicit filepath for the output. + Only do this if you have selected one output type.""") + parser_browser.add_argument('-p', '--plot_commands', + dest='CMD', + nargs = '+', + help="""Write strings here to select what + to plot. The format for each track is: + <type>:<path>:<style>:<options> + + To plot the reference, the format is: + ref:<style>:<options> + + Surround each track string with double + quotes and a space between subsequent strings. + "bam:mybam.bam:depth" "ref:colorful" + """) + parser_browser.add_argument('--ratio', + nargs = '+', + type = float, + default=None, + help="""Enter the dimensions (arbitrary units) + to plot the figure. For example a figure that is + seven times wider than tall is: + --ratio 7 1""") + parser_browser.add_argument('-r', '--reference', + metavar='REF', + dest = 'REF', + action=FullPaths, + help='The reference fasta file.') + parser_browser.add_argument('--start', + metavar='START', + dest = 'START', + type = int, + help="""The start position to observe on the + fasta file. Uses 1-based indexing.""") + parser_browser.add_argument('--stop', + metavar='STOP', + dest = 'STOP', + type = int, + help="""The stop position to observe on the + fasta file. Uses 1-based indexing.""") + parser_browser.add_argument('-T', '--transparent', + action='store_false', + help="""Specify this option if you DON'T want a + transparent background. Default is on.""") + parser_browser.set_defaults(func=run_subtool) + + ################ + # custommargin + ################ + parser_custmar = subparsers.add_parser('custommargin', + help='plot custom marginal histograms of tab-delimited files') + parser_custmar.add_argument('--dpi', + metavar='dpi', + default=600, + type=int, + help="""Change the dpi from the default 600 + if you need it higher""") + parser_custmar.add_argument('--fileform', + dest='fileform', + metavar='STRING', + choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', + 'pdf', 'pgf', 'ps', 'raw', 'rgba', + 'svg', 'svgz', 'tif', 'tiff'], + default=['png'], + nargs='+', + help='Which output format would you like? Def.=png') + parser_custmar.add_argument('-i', '--input_file', + action=FullPaths, + help="""A tab-separated file with a header row + of column names.""") + parser_custmar.add_argument('--xcol', + type=str, + help="""The column name of the data to plot on + the x-axis""") + parser_custmar.add_argument('--ycol', + type=str, + help="""The column name of the data to plot on + the y-axis""") + parser_custmar.add_argument('-n', '--no_transparent', + action='store_false', + help="""Specify this option if + you don't want a transparent background. Default + is on.""") + parser_custmar.add_argument("--no_timestamp", + action = 'store_true', + help="""Turn off time stamps in the filename + output.""") + parser_custmar.add_argument('-o', '--output_base_name', + default = None, + help='Specify a base name for the output file(' + 's). The input file base name is the ' + 'default.') + parser_custmar.add_argument('--plot_max_y', + type=int, + help="""Sets the maximum viewing area in the + length dimension.""") + parser_custmar.add_argument('--plot_max_x', + type=float, + help="""Sets the maximum viewing area in the + quality dimension.""") + parser_custmar.add_argument('--plot_min_y', + type=int, + default=0, + help="""Sets the minimum viewing area in the + length dimension. Default value = 0""") + parser_custmar.add_argument('--plot_min_x', + type=float, + default=0, + help="""Sets the minimum viewing area in the + quality dimension. Default value = 0""") + parser_custmar.add_argument('--square', + action='store_true', + help="""changes the hexmap into a square + map quantized by ints.""") + parser_custmar.add_argument('-t', '--title', + type = str, + help="""This sets the title for the whole plot. + Use --title "Crustacean's DNA read quality" + if you need single quote or apostrophe + inside title.""") + parser_custmar.add_argument('--ybin', + type=int, + help="""This sets the bin size to use for length.""") + parser_custmar.add_argument('--xbin', + type=float, + help="""This sets the bin size to use for quality""") + parser_custmar.add_argument('--add-yaxes', + dest='Y_AXES', + action='store_true', + help='Add Y-axes to both marginal histograms.') + parser_custmar.set_defaults(func=run_subtool) + + + ############# + # marginplot + ############# + parser_mnplot = subparsers.add_parser('marginplot', + help='plot a marginal histogram of a fastq file') + parser_mnplot.add_argument('--dpi', + metavar='dpi', + default=600, + type=int, + help="""Change the dpi from the default 600 + if you need it higher""") + parser_mnplot.add_argument('-f', '--fastq', + metavar='FASTQ', + action=FullPaths, + help='The input FASTQ file.') + parser_mnplot.add_argument('--fileform', + dest='fileform', + metavar='STRING', + choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', + 'pdf', 'pgf', 'ps', 'raw', 'rgba', + 'svg', 'svgz', 'tif', 'tiff'], + default=['png'], + nargs='+', + help='Which output format would you like? Def.=png') + parser_mnplot.add_argument('--filt_maxlen', + type=int, + help="""This sets the max read length filter reads.""") + parser_mnplot.add_argument('--filt_maxqual', + type=float, + help="""This sets the max mean read quality + to filter reads.""") + parser_mnplot.add_argument('--filt_minlen', + type=int, + default=0, + help="""This sets the min read length to + filter reads.""") + parser_mnplot.add_argument('--filt_minqual', + type=float, + default=0, + help="""This sets the min mean read quality + to filter reads.""") + parser_mnplot.add_argument('--kmerdf', + type = str, + default = None, + help = """Pass the filename of a data matrix if + you would like to plot read length + versus number of kmers in that read. The data matrix + is a tab-separated text file with columns + "id length numks and kmers", where: + <id> = read id + <length> = the length of the read + <numks> = the number of canonical kmers in the read + <kmers> = a list representation of kmers ie ['GAT', 'GTA']""") + parser_mnplot.add_argument('-n', '--no_transparent', + dest='TRANSPARENT', + action='store_false', + help="""Specify this option if + you don't want a transparent background. Default + is on.""") + parser_mnplot.add_argument("--no_timestamp", + action = 'store_true', + help="""Turn off time stamps in the filename + output.""") + parser_mnplot.add_argument('-o', '--output_base_name', + dest='BASENAME', + help='Specify a base name for the output file(' + 's). The input file base name is the ' + 'default.') + parser_mnplot.add_argument('--plot_maxlen', + type=int, + help="""Sets the maximum viewing area in the + length dimension.""") + parser_mnplot.add_argument('--plot_maxqual', + type=float, + help="""Sets the maximum viewing area in the + quality dimension.""") + parser_mnplot.add_argument('--plot_minlen', + type=int, + default=0, + help="""Sets the minimum viewing area in the + length dimension.""") + parser_mnplot.add_argument('--plot_minqual', + type=float, + default=0, + help="""Sets the minimum viewing area in the + quality dimension.""") + parser_mnplot.add_argument('--lengthbin', + metavar='LENGTHBIN', + type=int, + help="""This sets the bin size to use for length.""") + parser_mnplot.add_argument('--qualbin', + metavar='QUALBIN', + type=float, + help="""This sets the bin size to use for quality""") + parser_mnplot.add_argument('-t', '--title', + metavar='TITLE', + default='Read length vs mean quality', + help="""This sets the title for the whole plot. + Use --title "Crustacean's DNA read quality" + if you need single quote or apostrophe + inside title.""") + parser_mnplot.add_argument('-y', '--add-yaxes', + dest='Y_AXES', + action='store_true', + help='Add Y-axes to both marginal histograms.') + parser_mnplot.set_defaults(func=run_subtool) + + ############# + # redwood + ############# + parser_redwood = subparsers.add_parser('redwood', + help='make a redwood plot from a bam file') + parser_redwood.add_argument('-d', '--doubled', + dest='doubled', + choices=['main', 'rnaseq'], + default=[], + nargs='+', + help="""If your fasta file was doubled to map + circularly, use this flag or you may encounter + plotting errors. Accepts multiple arguments. + 'main' is for the sam file passed with --sam, + 'rnaseq' is for the sam file passed with --rnaseq""") + parser_redwood.add_argument('--dpi', + metavar='dpi', + default=600, + type=int, + help="""Change the dpi from the default 600 + if you need it higher""") + parser_redwood.add_argument('--fileform', + dest='fileform', + metavar='STRING', + choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', + 'pdf', 'pgf', 'ps', 'raw', 'rgba', + 'svg', 'svgz', 'tif', 'tiff'], + default=['png'], + nargs='+', + help='Which output format would you like? Def.=png') + parser_redwood.add_argument('--gff', + metavar='gff', + action=FullPaths, + help="""The input filepath for the gff annotation + to plot""") + parser_redwood.add_argument('-I', '--interlace', + action='store_true', + default=False, + help="""Interlace the reads so the pileup plot + looks better. Doesn't work currently""") + parser_redwood.add_argument('-i', '--invert', + action='store_true', + default=False, + help="""invert the image so that it looks better + on a dark background. DOESN'T DO ANYTHING.""") + parser_redwood.add_argument('-L', '--log', + action='store_true', + default=False, + help="""Plot the RNAseq track with a log scale""") + parser_redwood.add_argument('-M', '--main_bam', + metavar='mainbam', + action=FullPaths, + help="""The input filepath for the bam file to + plot. Ideally was plotted with a fasta file that + is two copies of the mitochondrial genome + concatenated. This should be long reads (ONT, PB) + and will be displayed in the interior of the + redwood plot.""") + parser_redwood.add_argument("--no_timestamp", + action = 'store_true', + help="""Turn off time stamps in the filename + output.""") + parser_redwood.add_argument('-o', '--output-base-name', + dest='BASENAME', + help='Specify a base name for the output file(' + 's). The input file base name is the ' + 'default.') + parser_redwood.add_argument('--query', + dest='query', + default=['ALNLEN >= 10000', 'MAPLEN < reflength'], + nargs='+', + help='Query your reads to change plotting options') + parser_redwood.add_argument('-R', '--rnaseq_bam', + metavar='rnabam', + action=FullPaths, + help='The input filepath for the rnaseq bam file to plot') + parser_redwood.add_argument('--small_start', + dest='small_start', + choices=['inside', 'outside'], + default='inside', + help="""This determines where the shortest of the + filtered reads will appear on the redwood plot: + on the outside or on the inside? The default + option puts the longest reads on the outside and + the shortest reads on the inside.""") + parser_redwood.add_argument('--sort', + dest='sort', + choices=['ALNLEN', 'TRULEN', 'MAPLEN', 'POS'], + default='ALNLEN', + help="""What value to use to sort the order in + which the reads are plotted?""") + parser_redwood.add_argument('--ticks', + type = int, + nargs = '+', + default = [0, 10, 100, 1000], + help="""Specify control for the number of ticks.""") + parser_redwood.add_argument('-T', '--transparent', + action='store_false', + help="""Specify this option if you DON'T want a + transparent background. Default is on.""") + parser_redwood.set_defaults(func=run_subtool) + + ############# + # stats + ############# + parser_stats = subparsers.add_parser('stats', + help='outputs stats from a fastq file') + parser_stats.add_argument('-f', '--fastq', + metavar='FASTQ', + action=FullPaths, + help='The input FASTQ file.') + parser_stats.add_argument('-H', '--histogram', + action='store_true', + help="""Make a histogram of the read lengths and + saves it to a new file""") + parser_stats.add_argument('--filt_maxlen', + type=int, + help="""This sets the max read length filter reads.""") + parser_stats.add_argument('--filt_maxqual', + type=float, + help="""This sets the max mean read quality + to filter reads.""") + parser_stats.add_argument('--filt_minlen', + type=int, + default=0, + help="""This sets the min read length to + filter reads.""") + parser_stats.add_argument('--filt_minqual', + type=float, + default=0, + help="""This sets the min mean read quality + to filter reads.""") + + parser_stats.set_defaults(func=run_subtool) + + ############# + # synplot + ############# + parser_synplot = subparsers.add_parser('synplot', + help="""make a synteny plot from a gff + file, protein alignment, and partition + file""") + parser_synplot.add_argument('--aln_dir', + metavar='aln_dir', + action=FullPaths, + help="""The directory where all the fasta + alignments are contained.""") + parser_synplot.add_argument('--center_on', + type=str, + default=None, + help="""Centers the plot around the gene that + you pass as an argument. For example, if there + is a locus called 'COI' in the gff file and in + the alignments directory, center using: + --center_on COI""") + parser_synplot.add_argument('--dpi', + metavar='dpi', + default=600, + type=int, + help="""Change the dpi from the default 600 + if you need it higher""") + parser_synplot.add_argument('--fileform', + dest='fileform', + choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', + 'pdf', 'pgf', 'ps', 'raw', 'rgba', + 'svg', 'svgz', 'tif', 'tiff'], + default=['png'], + nargs='+', + help='Which output format would you like? Def.=png') + parser_synplot.add_argument('--gff_paths', + metavar='gff_paths', + action=FullPathsList, + nargs='+', + help="""The input filepath. for the gff annotation + to plot. Individual filepaths separated by spaces. + For example, --gff_paths sp1.gff sp2.gff sp3.gff""") + parser_synplot.add_argument('--gff_labels', + metavar='gff_labels', + type=str, + nargs='+', + help="""In case the gff names and sequence names + don't match, change the labels that will appear + over the text.""") + parser_synplot.add_argument("--no_timestamp", + action = 'store_true', + help="""Turn off time stamps in the filename + output.""") + parser_synplot.add_argument('--optimum_order', + action='store_true', + help="""If selected, this doesn't plot the + optimum arrangement of things as they are input + into gff_paths. Instead, it uses the first gff + file as the top-most sequence in the plot, and + reorganizes the remaining gff files to minimize + the number of intersections.""") + parser_synplot.add_argument('-o', '--output-basename', + dest='BASENAME', + help='Specify a base name for the output file(' + 's). The input file base name is the ' + 'default.') + parser_synplot.add_argument('--ratio', + nargs = '+', + type = float, + default=None, + help="""Enter the dimensions (arbitrary units) + to plot the figure. For example a figure that is + seven times wider than tall is: + --ratio 7 1""") + parser_synplot.add_argument('--sandwich', + action='store_true', + default=False, + help="""Put an additional copy of the first gff + file on the bottom of the plot for comparison.""") + parser_synplot.add_argument('--start_with_aligned_genes', + action='store_true', + default=False, + help="""Minimizes the number of intersections + but only selects combos where the first gene in + each sequence is aligned.""") + parser_synplot.add_argument('--stop_codons', + action='store_true', + default=True, + help="""Performs some internal corrections if + the gff annotation includes the stop + codons in the coding sequences.""") + parser_synplot.add_argument('-T', '--transparent', + action='store_false', + help="""Specify this option if you DON'T want a + transparent background. Default is on.""") + parser_synplot.set_defaults(func=run_subtool) + + ####################################################### + # parse the args and call the selected function + ####################################################### + + args = parser.parse_args() + + # If there were no args, print the help function + if len(sys.argv) == 1: + parser.print_help() + sys.exit(1) + + # If there were no args, but someone selected a program, + # print the program's help. + commandDict = {'browser': parser_browser.print_help, + 'custommargin': parser_custmar.print_help, + 'marginplot': parser_mnplot.print_help, + 'redwood': parser_redwood.print_help, + 'stats': parser_stats.print_help, + 'synplot': parser_synplot.print_help} + + if len(sys.argv) == 2: + commandDict[args.command]() + sys.exit(1) + + if args.QUIET: + logger.setLevel(logging.ERROR) + + try: + args.func(parser, args) + except IOError as e: + if e.errno != 32: # ignore SIGPIPE + raise + +if __name__ == "__main__": + main() diff --git a/pauvre/rcparams.py b/pauvre/rcparams.py new file mode 100644 index 0000000..a8791b8 --- /dev/null +++ b/pauvre/rcparams.py @@ -0,0 +1,25 @@ +# -*- coding: utf-8 -*- + +from matplotlib import rcParams + +def update_rcParams(): + # This mpl style is from the UCSC BME163 class. + rcParams.update({ + 'font.size' : 8.0 , + 'xtick.major.size' : 2 , + 'xtick.major.width' : 0.75 , + 'xtick.labelsize' : 8.0 , + 'xtick.direction' : 'out' , + 'ytick.major.size' : 2 , + 'ytick.major.width' : 0.75 , + 'ytick.labelsize' : 8.0 , + 'ytick.direction' : 'out' , + 'xtick.major.pad' : 2 , + 'xtick.minor.pad' : 2 , + 'ytick.major.pad' : 2 , + 'ytick.minor.pad' : 2 , + 'savefig.dpi' : 601 , + 'axes.linewidth' : 0.75 , + "image.composite_image": False , + "pdf.fonttype" : 42, + "ps.fonttype" : 42 }) diff --git a/pauvre/redwood.py b/pauvre/redwood.py new file mode 100644 index 0000000..059440d --- /dev/null +++ b/pauvre/redwood.py @@ -0,0 +1,1047 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# twitter @conchoecia +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# SAM/BAM todo +# i/ for loop took 269 seconds ~ 4.48 minutes + +#things to do + +# This doesn't make much sense. it is perfectly possible that it could +# double and logbe both (in reference to how histograms are plotted + +# - make each layer operate independently +# - for each of these, make the program figure out the total length either from the GFF or from the bam file +# - make the error "Your query was too stringent and no reads resulted..." not give a traceback. +# - raise the proper error if the sam file has no header. +# - figure out how to update rcParams every time we run a program +# - Write a better docstring for how plotArc works. +# - Write a docstring for seqOrder method. I don't remember what it does +# - write a better function to get the alignment length of the sam/bam file +# right now it opens the file twice and only gets the length the first time +# - drop columns by name, not by column number (samFile.drop...) +# - Here's another that needs to be fixed: samFile.drop(samFile.columns[3], axis=1, inplace=True) +# - args +# - get the filename from args +# - set up a double-mapped mode to wrap reads around the 0th coordinate for +# circular assemblies +# - Make the r-dist something that the user can change. +# Getting Everything on the Same Track +# - Make a list of features to plot `plot_order = []` or something like that +# - First, go through the GFF features and come up with all of the things that +# AREN'T tRNAs that overlap. Store each set of overlaps as its own set of +# features. For things with no overlap, add those to the `plot_order` alone +# - Iterate through all elements of `plot_order`, if all elements are forward +# (start < stop), then draw the element at the end first with no modification, +# then for every subsequent element draw a white border around the arrow. +# The element same element should have the start of its arrow drawn to 1 +# degree before the start of the next one. This will make a chevron pattern +# That will show both the start and stop of both reads accurately. +# - If both elements are reverse, then do the same thing, but in reverse +# - If one element is in reverse, but another is forward, then make all the +# elements in the set half-width since there are too many possible +# combinations to code reliably. + +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.patches as patches +from matplotlib import rcParams +import platform +import itertools +import pandas as pd +import numpy as np +import scipy as sp +import time +import os, sys +import warnings + +# import the pauvre rcParams +import pauvre.rcparams as rc +from pauvre.functions import GFFParse, print_images +from pauvre.bamparse import BAMParse + +# following this tutorial to install helvetica +# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md +global hfont +hfont = {'fontname':'Helvetica'} + +def plotArc(start_angle, stop_angle, radius, width, **kwargs): + """ write a docstring for this function""" + numsegments = 100 + theta = np.radians(np.linspace(start_angle+90, stop_angle+90, numsegments)) + centerx = 0 + centery = 0 + x1 = -np.cos(theta) * (radius) + y1 = np.sin(theta) * (radius) + stack1 = np.column_stack([x1, y1]) + x2 = -np.cos(theta) * (radius + width) + y2 = np.sin(theta) * (radius + width) + stack2 = np.column_stack([np.flip(x2, axis=0), np.flip(y2,axis=0)]) + #add the first values from the first set to close the polygon + np.append(stack2, [[x1[0],y1[0]]], axis=0) + arcArray = np.concatenate((stack1,stack2), axis=0) + return patches.Polygon(arcArray, True, **kwargs), ((x1, y1), (x2, y2)) + +def fix_query_reflength(sequence_length, queries, doubled): + """ + arguments: + <sequence_length> This is the reference fasta length. It should be 2x the actual + length of the reference since this program takes a sam file from + a concatenated reference. + <queries> This is a list of SQL-type query strings. This is generated + from argparse. + + purpose: + This function takes in a list of queries to use for read filtering + for the redwood plot. It is often not advisable to plot all mapped reads + since many of them are too small relative to the reference length. Also, + the point of a death star plot is to show continuity of a circular + reference, so short reads aren't very helpful there either. + + Currently, this function only recognizes the keyword argument 'reflength'. + """ + if not doubled: + sequence_length = int(sequence_length * 2) + for i in range(len(queries)): + if 'reflength' in queries[i].split(): + queries[i] = queries[i].replace('reflength', str(int(sequence_length/2))) + +def cust_log(base, val): + try: + #val = np.log(val)/np.log(base) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + val = np.log(val) + except: + val = 0 + if np.isinf(val): + val = 0 + return val + +def plot_histo(panel, args, angleMap, width_map, start_radius, track_radius, + thisLog = False, double = False, ticks = False): + """Plots a histogram given a width map. Width map must be the true length + of the circular genome""" + myPatches = [] + if double: + #plot a line in the middle if doubled to distinguish the center + mid_radius = start_radius+((track_radius - (track_radius/100))/2) + arc, arcArray = plotArc(start_angle=0, stop_angle=360, + radius=mid_radius, + width=track_radius/100, fc='yellow') + myPatches.append(arc) + #this is only 1/100 the width of the band + + for i in range(len(width_map)): + iStartIndex = i + iStopIndex = i + 1 + iStartAngle = angleMap[iStartIndex] + iStopAngle = angleMap[iStopIndex] + # This doesn't make much sense. it is perfectly possible that it could + # double and logbe both + if double: + logwidth = -1 * track_radius * (np.log(width_map[i]-9)/np.log(max(width_map))) * 0.495 + + log_start_radius = start_radius+((track_radius - (track_radius/100))/2) + width = track_radius * (width_map[i]/max(width_map)) * 0.495 + normal_start_radius = start_radius+((track_radius - (track_radius/100))/2) + (track_radius/100) + #first plot the log inside + arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle, + radius=log_start_radius, + width=logwidth, fc='black') + myPatches.append(arc) + # now plot the normal outside + arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle, + radius=normal_start_radius, + width=width, fc='black') + else: + if thisLog: + base = 10 + width = track_radius * (cust_log(base, width_map[i])/cust_log(base, max(width_map))) + + else: + width = track_radius * (width_map[i]/max(width_map)) + arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle, + radius=start_radius, + width=width, fc='black') + myPatches.append(arc) + + maxx = start_radius + maxy = 0 + kerning = 12 + if ticks != None and len(ticks) > 0: + # plot the scalebar + maxval=max(width_map) + xs = [] + ys = [] + xend = start_radius + value_list = ticks + for value in value_list: + centerAngle = 45 + this_radius = start_radius + \ + (track_radius * (cust_log(10, value)/cust_log(10, maxval))) + print("this_radius", this_radius) + arc, arcArray = plotArc(start_angle=centerAngle-1, stop_angle=centerAngle+1, + radius=this_radius, + width=track_radius/25, fc='red') + xs.append(arcArray[0][0][-1]) + ys.append(arcArray[0][1][-1]) + myPatches.append(arc) + + middle = np.mean(ys) - (kerning/2) + # The tick marks should start slightly below the lowest tick + # mark on the track itself and extend upward evenly with text. + new_ys = [middle-kerning + (kerning * i) + for i in range(len(value_list))] + for i in range(len(value_list)): + xlist = [xs[i], xend] + ylist = [ys[i], new_ys[i]] + # plots the lines between the markers and the labels + panel.plot(xlist, ylist, ls='--', color='red', lw=0.75) + # plots the text labels + panel.text(xend, new_ys[i], + value_list[i], fontsize = 12, + ha='left', va='center', + color = 'black', **hfont) + #plots the title + panel.text(start_radius, new_ys[-1]+kerning, + "Read Depth ", fontsize = 10, + ha='center', va='center', + color = 'black', **hfont) + + + return myPatches, start_radius + track_radius, panel + +def plot_reads(args, angleMap, widthDict, samFiledf, start_radius, + doubled = False, collapse = False, track_width=False, + track_depth = False, thisLog = False): + """this function plots the reads in a sam file. + Outputs the patches, and the final_radius + + Author: + Darrin T. Schultz (github@conchoecia) + """ + + # there should be a different width dict with log scale + #if args.log and thisLog: + # widthDict = {'M':0.8, # match + # 'I':0.95, # insertion relative to reference + # 'D':0.05, # deletion relative to reference + # 'N':0.1, # skipped region from the reference + # 'S':0.1, # soft clip, not aligned but still in sam file + # 'H':0.1, # hard clip, not aligned and not in sam file + # 'P':0.1, # padding (silent deletion from padded reference) + # '=':0.1, # sequence match + # 'X':0.1} # sequence mismatch + + # clean up the df to reset the number of rows, otherwise there might be + # errors in the while loop below + samFiledf = samFiledf.reset_index() + + append_radius = 0 + myPatches = [] + depth_map = [] + # if we need to collapse the reads so they aren't 1 read per line, keep track + # of plotted depth at a position with depth_map + if collapse == True: + if doubled: + plotted_depth_map = [0] * int((len(angleMap)/2)) + else: + plotted_depth_map = [0] * len(angleMap) + + # If we define a track width, then use the max read depth + # of the track to figure out the read_width + read_width = 1 + if track_width: + read_width = track_width/track_depth + + # This while loops cycles through the the list until everything is plotted + # We start at the first index of the file and count down. If we plot it, then + # we remove that read from the dataframe. Once we get to the last index, + # we attempt to plot and afterward reset the index of the dataframe and + # start from the beginning again. + i = 0 + plotted = False + #this is used to dtermine where we are during collapse + current_row = 1 + if args.log and thisLog: + current_row = 10 + skip = False + # for printing out the progress of plotting, + # get the original number of rows to make placeholders + original_rownum_charcount = len(str(len(samFiledf))) + direction = 'for' + print(samFiledf) + while len(samFiledf) > 0: + stringTuples = samFiledf.loc[i, 'TUPS'] + mapLen = samFiledf.loc[i, 'MAPLEN'] + #print("i: {} and maplen: {}".format(i, mapLen)) + #Subtract one because BAM uses 1-based indexing but plotting uses 0. + # I think I could avoid this in the future by changing the parse + start_index = samFiledf.loc[i, 'POS'] - 1 + start_angle= angleMap[start_index] + stop_index = 0 + stop_angle = angleMap[stop_index] + if args.log and thisLog: + log = 100 + read_width = track_width * ((cust_log(log, current_row + 1)/cust_log(log, track_depth + 1))\ + - (cust_log(log, current_row)/cust_log(log, track_depth + 1))) + #print("log track depth: {}".format(np.log10(track_depth + 1))) + #print("log current row: {}".format(np.log10(current_row + 1))) + #now look in the plotted_depth_map to see if there is already a read + # that overlaps with the current read we are considering. + if collapse: + for collapse_i in range(start_index, start_index+mapLen): + #print("looking at {} and found {}".format(collapse_i, plotted_depth_map[collapse_i])) + if plotted_depth_map[collapse_i] >= current_row: + skip = True + break + #if we don't skip it, we're gonna plot it! + if not skip: + # Note that we are plotting something here + if collapse: + for collapse_i in range(start_index, start_index+mapLen): + plotted_depth_map[collapse_i] = plotted_depth_map[collapse_i] + 1 + for tup in stringTuples: + if tup[1] == 'I': + #If there is an insertion, back up halfway and make plot the + # insertion to visually show a "bulge" with too much sequence. + # do not advance the start index to resume normal plotting + # after the insertion. + iStartIndex = start_index-int(tup[0]/2) + iStopIndex = iStartIndex + tup[0] + iStartAngle = angleMap[iStartIndex] + iStopAngle = angleMap[iStopIndex] + arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle, + radius=start_radius + append_radius, + width=widthDict[tup[1]] * read_width, fc='black') + else: + stop_index = start_index + tup[0] + stop_angle = angleMap[stop_index] + arc, arcArray = plotArc(start_angle=start_angle, stop_angle=stop_angle, + radius=start_radius + append_radius, + width=widthDict[tup[1]] * read_width, fc='black') + start_index = stop_index + start_angle = angleMap[start_index] + myPatches.append(arc) + # If we're not collapsing the reads, we just advance one every row + if not collapse: + append_radius += read_width + plotted = True + #if we've ploted something, remove that read and reset the indices + # since we will stay at the current index + samFiledf = samFiledf.drop(i) + samFiledf = samFiledf.reset_index() + samFiledf = samFiledf.drop('index', 1) + print("(countdown until done) rows: {0:0>{num}} dir: {1}\r".format( + len(samFiledf), direction, num= original_rownum_charcount), end="") + else: + #if we weren't able to plot the current read, we will advance the + # index and look for another read to plot + # I tried to speed up the algorithm here by making the software 'smart' + # about looking forward for the next read, but it was ~50% slower + i += 1 + skip = False + # we will only reach this condition if we aren't collapsing, since all + # the reads will be removed before getting to this point + if i >= len(samFiledf): + i = 0 + #Once we've gone around a complete cycle, we can jump up to start + # plotting the next row + append_radius += read_width + current_row += 1 + #every time we reset, reorganize so that we're now going in the + # opposite direction to avoid skewing all the reads in the forward direction + if args.interlace: + if direction == 'for': + bav = {"by":['POS','MAPLEN'], "asc": [False, False]} + direction= 'rev' + elif direction == 'rev': + bav = {"by":['POS','MAPLEN'], "asc": [True, False]} + direction = 'for' + samFiledf.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True) + samFiledf.reset_index(inplace=True) + samFiledf.drop('index', 1, inplace=True) + print("\nfinal row is {}".format(current_row)) + return myPatches, start_radius + append_radius + +def redwood(args): + """This function controls all the plotting features for the redwood plot. + 1) determine the length of the circular fasta sequence length + """ + # get the plotting options specific to this program. + rc.update_rcParams() + # This stops numpy from printing numbers in scientific notation. + np.set_printoptions(suppress=True) + + + start = time.time() + print(args) + + #--------------------------------------------------------------- + # 1) determine the length of the circular fasta sequence length + #_______________________________________________________________ + main_doubled = True if 'main' in args.doubled else False + # It is fine that we are using a global since this will never be manipulated + global sequence_length + if args.main_bam: + samFile = BAMParse(args.main_bam, doubled = main_doubled) + sequence_length = samFile.seqlength + filename = samFile.filename + else: + sequence_length = GFFParse(args.gff).seqlen + if sequence_length == 0: + raise OSError("""You have used a SAM/BAM file with no header. Please add a header to + the file.""") + + # this also needs to be changed depending on if it was a concatenated SAM + # if doubled = true, then use linspace between 0,720 + # if doubled = false, then use linspace between 0, 360 + # on second thought, it might not be necessary to change this value even + # for doubled sequences + global angleMap + if 'main' in args.doubled: + angleMap = np.linspace(0,720,sequence_length) + else: + angleMap = np.linspace(0,360,sequence_length) + + #these are the line width for the different cigar string flags. + # usually, only M, I, D, S, and H appear in bwa mem output + widthDict = {'M':0.45, # match + 'I':0.9, # insertion relative to reference + 'D':0.05, # deletion relative to reference + 'N':0.1, # skipped region from the reference + 'S':0.1, # soft clip, not aligned but still in sam file + 'H':0.1, # hard clip, not aligned and not in sam file + 'P':0.1, # padding (silent deletion from padded reference) + '=':0.1, # sequence match + 'X':0.1} # sequence mismatch + + ################## + # make the redwood plot + ################## + + figWidth = 5 + figHeight = 5 + + # fig1 is an arbitrary name, only one figure currently + fig_1 = plt.figure(figsize=(figWidth, figHeight)) + + # This is the width and height of the plot in absolute + # values relative to the figWidth and figHeight. + circleDiameter = 5.0 + + # Center to plot + leftMargin = (figWidth - circleDiameter)/2 + bottomMargin = leftMargin + + # There is one panel in this figure that contains + # the concentric circles that represent reads. In addition, + # the latest version of this program adds the annotation to the exterior + panelCircle = plt.axes([leftMargin/figWidth, #left + bottomMargin/figHeight, #bottom + circleDiameter/figWidth, #width + circleDiameter/figHeight #height + ],frameon=False) + # Some of these are defaults and redundant, but are included + # for readability and convenience. + panelCircle.tick_params(axis='both',which='both',\ + bottom='off', labelbottom='off',\ + left='off', labelleft='off', \ + right='off', labelright='off',\ + top='off', labeltop='off') + + # Simply the list in which patches will be stored. + myPatches = [] + + # Each read occupies a width of radius = 1. The max width occupied by any + # type of match to the read is 0.9 (Insertion). The line width of a match + # is 0.45. Look at the `widthDict` dictionary above to see the defined + # widths. I chose radius = 15 to start with because it looks decent in most + # scenarios, but maybe could use some tweaking in future iterations. + start_radius_dict = {1: 10, + 20: 15, + 30: 20, + 35: 90} + if args.main_bam: + print("The number of rows is: {}".format(len(samFile.features))) + for key in sorted(start_radius_dict): + if len(samFile.features) >= key: + radius = start_radius_dict[key] + print("Chose radius: {}".format(radius)) + else: + radius = start_radius_dict[30] + + circle_fontsize = 14 + panelCircle.text(0, 0, "Position\n(bp)", fontsize = circle_fontsize, + ha='center', va='center', + color = 'black', **hfont) + + # now plot ticks around the interior, as well as the text + if 'main' in args.doubled: + real_length = int(sequence_length/2) + else: + real_length = sequence_length + this_len = int(real_length / 1000) * 1000 + for i in range(0, this_len + 1000, 1000): + startAngle = angleMap[i] - 0.75 + stopAngle = angleMap[i] + 0.75 + this_radius = radius * 0.93 + this_width = radius * 0.04 + arc, arcArray = plotArc(start_angle=startAngle, stop_angle=stopAngle, + radius=this_radius, + width=this_width, fc='black') + myPatches.append(arc) + # now plot text if (val/1000) % 2 == 0 + if (i/1000) % 2 == 0: + #the 0.98 gives some float + x_pos = -np.cos(np.radians(angleMap[i]+90)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius + rotation = 0 + if i == 0: + y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius * 0.98 + rotation = 0 + ha = 'center' + va = 'top' + if (angleMap[i] > 0 and angleMap[i] < 45): + x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.5)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 + 1.5)) * this_radius + rotation = (90 - angleMap[i]) * 0.95 + ha = 'right' + va = 'top' + elif (angleMap[i] >= 45 and angleMap[i] < 67.5): + rotation = 90 - angleMap[i] + ha = 'right' + va = 'top' + elif (angleMap[i] >= 67.5 and angleMap[i] < 90): + # subtracted an addtl degree because it looked bad otherwise + x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius + rotation = 90 - angleMap[i] + ha = 'right' + va = 'top' + elif (angleMap[i] >= 90 and angleMap[i] < 112.5): + # added an addtl degree because it looked bad otherwise + x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius + rotation = 90 - angleMap[i] + ha = 'right' + #bottom is not good + va = 'center' + elif (angleMap[i] >= 112.5 and angleMap[i] < 135): + # added an addtl degree because it looked bad otherwise + x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.25)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 + 1.25)) * this_radius + rotation = 90 - angleMap[i] + ha = 'right' + #bottom not good + #center really not good + va = 'bottom' + elif (angleMap[i] >= 135 and angleMap[i] < 157.5): + rotation = 90 - angleMap[i] + ha = 'right' + va = 'bottom' + elif (angleMap[i] >= 157.5 and angleMap[i] < 180): + x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius + rotation = 90 - angleMap[i] + ha = 'right' + va = 'bottom' + elif (angleMap[i] >= 180 and angleMap[i] < 202.5): + # added an addtl degree because it looked bad otherwise + x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.0)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 + 2.0)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'bottom' + elif (angleMap[i] >= 202.5 and angleMap[i] < 225): + x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.25)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 + 2.25)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'bottom' + elif (angleMap[i] >= 225 and angleMap[i] < 247.5): + x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'bottom' + elif (angleMap[i] >= 247.5 and angleMap[i] < 270): + x_pos = -np.cos(np.radians(angleMap[i]+90 - 3.0)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 3.0)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'bottom' + elif (angleMap[i] >= 270 and angleMap[i] < 292.5): + rotation = 270 - angleMap[i] + ha = 'left' + va = 'bottom' + elif (angleMap[i] >= 292.5 and angleMap[i] < 315): + x_pos = -np.cos(np.radians(angleMap[i]+90 + 0.25)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 + 0.25)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'top' + elif (angleMap[i] >= 315 and angleMap[i] < 337.5): + x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.6)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 1.6)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'top' + elif (angleMap[i] >= 337.5 and angleMap[i] < 360): + x_pos = -np.cos(np.radians(angleMap[i]+90 - 5.0)) * this_radius + y_pos = np.sin(np.radians(angleMap[i]+90 - 5.0)) * this_radius + rotation = 270 - angleMap[i] + ha = 'left' + va = 'top' + print(" angleMap: {} value: {} rotation: {}".format(angleMap[i], i, rotation)) + text = "{}".format(i) + if i == 0: + text = "1/\n{}".format(real_length) + panelCircle.text(x_pos, y_pos, text, fontsize = circle_fontsize, + ha=ha, va=va, color = 'black', rotation=rotation, **hfont) + # Now add a legend + + # If the user wants to plot long reads, plot them + if args.main_bam: + #turn the query string into something usable, + # get rid of variables from argparse + # Make sure we haven't told the program to not query + if 'False' not in args.query: + doubled = 'main' in args.doubled + fix_query_reflength(sequence_length, args.query, doubled) + # string all the queries together + queryString = " and ".join(args.query) + print("You are using this query string to filter reads:\n'{}'".format(queryString)) + samFile = samFile.features.query(queryString) + if len(samFile) == 0: + raise IOError("""Your query was too stringent and no reads resulted. Please try + again with a less stringent test. Redwood plotter + exiting""") + + # now determine how to sort the reads in the order they will be plotted + if args.small_start == 'inside': + ascend = True + elif args.small_start == 'outside': + ascend = False + samFile = samFile.sort_values(by=args.sort, ascending=ascend) + samFile = samFile.reset_index() + # It is necessary to drop the column called index, since reset_index() + # adds this. + samFile = samFile.drop('index', 1) + + #this plots the central rings from the sam file + read_patches, radius = plot_reads(args, angleMap, widthDict, samFile, + radius, doubled = True, collapse = False) + myPatches = myPatches + read_patches + + # if the user would like to plot the annotation, plot it now. In the future, + # allow the user to select the order in which the individual elements are + # plotted. Since the annotation should have a fixed proportional size to + # the circle independent of the number of plotted reads, define a new + # radius context. + if args.gff: + print("the radius at the end of annotation is: {}".format(radius)) + panelCircle, gff_patches, gff_radius = plot_gff(args, panelCircle, args.gff, radius) + myPatches = myPatches + gff_patches + radius = gff_radius + + # it is helpful to be able to plot the RNAseq data along with the annotation. + # plot that directly outside the annotation + if args.rnaseq_bam: + print("in RNAseq") + rna_doubled = True if 'rnaseq' in args.doubled else False + bamobject = BAMParse(args.rnaseq_bam, doubled = rna_doubled) + samFile = bamobject.features + track_width = radius * 0.15 + #read_patches, radius = plot_reads(args, angleMap, + # widthDict, samFile, radius, + # doubled = rna_doubled, + # collapse = True, track_width = track_width, + # track_depth = max(bamobject.raw_depthmap), + # thisLog = True) + radius_orig=radius + read_patches, radius, panelCircle = plot_histo(panelCircle, args, angleMap, + bamobject.get_depthmap(), radius, + track_width, + thisLog = True, + ticks = args.ticks) + myPatches = myPatches + read_patches + myPatches.append(arc) + + # The numseqs value is used to determine the viewing dimensions + # of the circle we will plot. It is scaled with the number of sequences + # that will be plotted. This value should probably be set last to + # accommodate other tracks, like annotation. + min_radius = int(-5 - np.ceil(radius)) + max_radius = int(5 + np.ceil(radius)) + panelCircle.set_xlim([min_radius, max_radius]) + panelCircle.set_ylim([min_radius, max_radius]) + + for patch in myPatches: + panelCircle.add_patch(patch) + + end = time.time() + print(end - start) + # Print image(s) + if args.BASENAME is None: + file_base = "redwood" + else: + file_base = args.BASENAME + print_images( + base_output_name=file_base, + image_formats=args.fileform, + no_timestamp = args.no_timestamp, + dpi=args.dpi, + transparent=args.transparent) + +def feature_set_direction(feature_set_df): + """This function determines if the features in the dataframe passed here are + all forward, all reverse, or mixed""" + all_pos = all(feature_set_df['strand'] == '+') + all_neg = all(feature_set_df['strand'] == '-') + if all_pos: + return '+' + elif all_neg: + return '-' + else: + return 'mixed' + +def plot_feature(this_feature, colorMap, start_radius, + bar_thickness, direction, this_feature_overlaps_feature): + """This plots the track for a feature, and if there is something for + 'this_feature_overlaps_feature', then there is special processing to + add the white bar and the extra slope for the chevron + """ + myPatches = [] + if this_feature_overlaps_feature.empty: + iStartAngle = angleMap[this_feature['start']] + iStopAngle = angleMap[this_feature['stop']] - 2 + arc, arcArray = plotArc(start_angle=iStartAngle, + stop_angle=iStopAngle, + radius = start_radius, width=bar_thickness, + fc=colorMap[this_feature['featType']]) + myPatches.append(arc) + #this bit plots the arrow triangles for the genes. + # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it. + # This resolution hasn't given me any artifacts thus far + tStartAngle = angleMap[this_feature['stop']]-2 + tStopAngle = angleMap[this_feature['stop']] + angles = np.linspace(tStartAngle,tStopAngle,100) + widths = np.linspace(bar_thickness,0,100) + for j in range(len(angles)-1): + arc, arcArray = plotArc(start_angle=tStartAngle, + stop_angle=angles[j+1], + radius=start_radius+(bar_thickness-widths[j+1])/2, + width=widths[j+1], + fc=colorMap[this_feature['featType']]) + myPatches.append(arc) + else: + # First, make a solid white bar + iStartAngle = angleMap[this_feature['start']] + iStopAngle = angleMap[this_feature_overlaps_feature['start']] + arc, arcArray = plotArc(start_angle=iStartAngle, + stop_angle=iStopAngle, + radius = start_radius, width=bar_thickness, + fc=colorMap['spacebar']) + myPatches.append(arc) + # Now, make the actual color bar + iStartAngle = angleMap[this_feature['start']] + iStopAngle = angleMap[this_feature_overlaps_feature['start']] - 1 + arc, arcArray = plotArc(start_angle=iStartAngle, + stop_angle=iStopAngle, + radius = start_radius, width=bar_thickness, + fc=colorMap[this_feature['featType']]) + myPatches.append(arc) + #first plot a little pink bar for the outline + tStartAngle = angleMap[this_feature_overlaps_feature['start']] + tStopAngle = angleMap[this_feature['stop']]+1 + angles = np.linspace(tStartAngle,tStopAngle,100) + widths = np.linspace(bar_thickness,0,100) + for j in range(len(angles)-1): + arc, arcArray = plotArc(start_angle=tStartAngle, + stop_angle=angles[j+1], + radius=start_radius+(bar_thickness-widths[j+1])/2, + width=widths[j+1], + fc=colorMap['spacebar']) + myPatches.append(arc) + #this bit plots the arrow triangles for the genes. + # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it. + # This resolution hasn't given me any artifacts thus far + tStartAngle = angleMap[this_feature_overlaps_feature['start']]-1 + tStopAngle = angleMap[this_feature['stop']] + angles = np.linspace(tStartAngle,tStopAngle,100) + widths = np.linspace(bar_thickness,0,100) + for j in range(len(angles)-1): + arc, arcArray = plotArc(start_angle=tStartAngle, + stop_angle=angles[j+1], + radius=start_radius+(bar_thickness-widths[j+1])/2, + width=widths[j+1], + fc=colorMap[this_feature['featType']]) + myPatches.append(arc) + return myPatches + +def get_angles(name, center_angle, kerning_angle): + num_chars = len(name.strip()) + if num_chars == 2: + start_pos = center_angle + stop_pos = center_angle + kerning_angle + return (start_pos, stop_pos) + if num_chars % 2 == 0: + start_pos = center_angle - (kerning_angle/2) - ((num_chars - 1)/2) + stop_pos = center_angle + (kerning_angle/2) + ((num_chars - 1)/2) + else: + start_pos = center_angle - (num_chars/2) + stop_pos = center_angle + (num_chars/2) + angles = np.arange(start_pos, stop_pos + kerning_angle, kerning_angle) + return angles + +def plot_gff(args, panelCircle, gff_path, radius): + #parse the gff file + gffParser = GFFParse(gff_path) + + # Because this size should be relative to the circle that it is plotted next + # to, define the start_radius as the place to work from, and the width of + # each track. + start_radius = radius + track_width = radius * 0.15 + + colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', + 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white', + 'ORF':'orange'} + augment = 0 + bar_thickness = 0.9 * track_width + # return these at the end + myPatches=[] + plot_order = [] + # this for loop relies on the gff features to already be sorted + i = 0 + idone = False + # we need to filter out the tRNAs since those are plotted last + plottable_features = gffParser.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'") + plottable_features = plottable_features.reset_index(drop=True) + while idone == False: + print("im in the overlap-pairing while loop i={}".format(i)) + # look ahead at all of the elements that overlap with the ith element + jdone = False + j = 1 + this_set_minimum_index = i + this_set_maximum_index = i + while jdone == False: + print("new i= {} j={} len={}".format(i, j, len(plottable_features))) + # first make sure that we haven't gone off the end of the dataframe + if i+j == len(plottable_features): + if i == len(plottable_features)-1: + # this is the last analysis, so set idone to true + # to finish after this + idone = True + # the last one can't be in its own group, so just add it solo + these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,] + plot_order.append(these_features.reset_index(drop=True)) + break + jdone == True + else: + # if the lmost of the next gene overlaps with the rmost of + # the current one, it overlaps and couple together + if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']: + # note that this feature overlaps with the current + this_set_maximum_index = i+j + # ... and we need to look at the next in line + j += 1 + else: + i += 1 + (this_set_maximum_index - this_set_minimum_index) + #add all of the things that grouped together once we don't find any more groups + these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,] + plot_order.append(these_features.reset_index(drop=True)) + jdone = True + + for feature_set in plot_order: + print(feature_set) + direction = feature_set_direction(feature_set) + print("direction = {}".format(direction)) + if direction == '+': + for i in range(len(feature_set)-1, -1,-1): + print("inside the plot for loop i = {}".format(i)) + this_feature = feature_set.loc[i,] + print("got this single feature") + #For the first element, just plot it normally + if i == len(feature_set) - 1: + #plot the annotation + print("Im in the first plot thing") + patches = plot_feature(this_feature, colorMap, start_radius, + bar_thickness, direction, + pd.Series([])) + for each in patches: + myPatches.append(each) + else: + print("now I'm plotting the other one") + overlapped_feature = feature_set.loc[i+1,] + print("this is the overlapped feature") + print(overlapped_feature) + patches = plot_feature(this_feature, colorMap, start_radius, + bar_thickness, direction, + overlapped_feature) + for each in patches: + myPatches.append(each) + + final_radius = start_radius + track_width + # Now we add all of the tRNAs to this to plot, do it last to overlay + # everything else + tRNAs = gffParser.features.query("featType == 'tRNA'") + tRNAs.reset_index(inplace=True, drop = True) + print(tRNAs) + tRNA_bar_thickness = bar_thickness * (0.8) + tRNA_start_radius = start_radius + ((tRNA_bar_thickness * (0.2))/2) + #tRNA_bar_thickness = bar_thickness + #tRNA_start_radius = start_radius + print("sequence_length = {}".format(sequence_length)) + angle_ranges = [] + for i in range(0,len(tRNAs)): + this_feature = tRNAs.loc[i,] + min_angle = angleMap[min(this_feature.loc['start'],this_feature.loc['stop'])] + max_angle = angleMap[max(this_feature.loc['start'],this_feature.loc['stop'])] + angle_ranges.append((min_angle, max_angle)) + patches = plot_feature(this_feature, colorMap, tRNA_start_radius, + tRNA_bar_thickness, direction, + pd.Series([])) + for each in patches: + myPatches.append(each) + print("angle ranges of tRNAs") + print(angle_ranges) + + # now plot the text of the all the things. do this last + # to cover the previous things we plotted + for i in range(len(gffParser.features.index)): + if gffParser.features.loc[i, 'featType'] not in ['region', 'source', 'tRNA']: + angles = [] + # to plot the centers, first figure out the center of the annotated + # gene by subtracting the end from the beginning + name = gffParser.features.loc[i,'name'] + middle_position = int(gffParser.features.loc[i,'center']) + # calculate the radius, because it changes depending on the track. + # Remember that the radius is the position at the bottom of the track, + # so we must use va=bottom. + center_radius = radius + (gffParser.features.loc[i,'track'] * track_width) + (bar_thickness/2) + # then, figure out the center angle of that position. + # I subtract one since I want to center this for the non-arrow + # part of the bar. + center_angle = angleMap[middle_position]-2 + count = 0 + while count < 1: + print() + print(name) + #figure out how many characters to plot + kerning_angle = 2.5 + print("putting in {} into get angles with center angle = {}, kerning_angle = {}".format(name, center_angle, kerning_angle)) + angles = get_angles(name, center_angle, kerning_angle) + char_min_angle = angles[0]-(kerning_angle / 2) + char_max_angle = angles[-1] - (kerning_angle / 2) + overlapping_angles = [] + # for every possible tRNA position, see if the minimum or + # max angle lies in the range of the text + for tRNA_overlap_range in angle_ranges: + for tRNA_position in tRNA_overlap_range: + if (char_min_angle < tRNA_position and + tRNA_position < char_max_angle): + #overlapping_angles contains the tRNA overlapping angles now + overlapping_angles.append(tRNA_position) + print("character angles {}".format(angles)) + print("overlapping angles {}".format(overlapping_angles)) + print("in while loop, center = {}".format(center_angle)) + if len(overlapping_angles) == 0: + count += 1 + else: + min_overlapping_angle = min(overlapping_angles) + print("min_overlapping_angle = {}".format(min_overlapping_angle)) + max_overlapping_angle = max(overlapping_angles) + print("max_overlapping_angle = {}".format(max_overlapping_angle)) + gene_start_angle = angleMap[gffParser.features.loc[i,'start']] + gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']] + start_dif = abs(min_overlapping_angle - gene_start_angle) + stop_dif = abs(gene_stop_angle - max_overlapping_angle) + print("start_dif = {}, stop_dif = {}".format(start_dif, stop_dif)) + if start_dif > stop_dif: + center_angle = min_overlapping_angle - ((min_overlapping_angle - gene_start_angle)/2) + else: + # the extra -2 is to compensate for the taper. This only + # applies to the end of a gene stop. This code won't + # work properly for - strand things currently + center_angle = gene_stop_angle - 2 - ((gene_stop_angle - max_overlapping_angle)/2) + count += 1 + print("center angle after assignment = {}".format(center_angle)) + #There is some overlap in the text and we need to + # figure out the min overlap and max overlap. Then figure + # out if the angle between min and start or max and end is greater + # use whichever is greater as the new place to put the text, then calculate the center + angles = get_angles(name, center_angle, kerning_angle) + text_angle = angles[-1] - angles[0] + # this prints all the characters in their positions and rotates + print_count = 0 + for j in range(len(angles)): + this_width = gffParser.features.loc[i,'width'] + if not text_angle > (angleMap[this_width] - 1): + if center_angle > 95 and center_angle < 265: + rotation = (-1 * center_angle) - 180 + this_char = name[len(angles) - 1 - j] + else: + rotation = -1 * center_angle + print("name: {} j: {}".format(name, j)) + this_char = name[j] + this_angle = angles[j] + # now calculate the absolute x position + x_pos = -np.cos(np.radians(this_angle+90)) * center_radius + # now calculate the absolute y position + y_pos = np.sin(np.radians(this_angle+90)) * center_radius + #rotate the text if necessary + panelCircle.text(x_pos, y_pos, this_char, fontsize=10, + ha='center', va='center', + rotation = rotation, + family = 'monospace', + color = 'white') + # this block handles the case where the text is too small to put + # parallel with the mitochondrial circle, so it is perpindicular + else: + if print_count == 0: + if gffParser.features.loc[i,'strand'] == '+': + gene_start_angle = angleMap[gffParser.features.loc[i,'start']] + gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']] + center_angle = gene_stop_angle - ((gene_stop_angle-gene_start_angle)/2) + # now calculate the absolute x position + x_pos = -np.cos(np.radians(center_angle+90)) * center_radius + # now calculate the absolute y position + y_pos = np.sin(np.radians(center_angle+90)) * center_radius + if (center_angle > 0 and center_angle < 180): + rotation = -1 * center_angle + 90 + else: + rotation = -1 * center_angle - 90 + panelCircle.text(x_pos, y_pos, name, fontsize=6, + ha='center', va='center', + rotation = rotation, + family = 'monospace', + color = 'white') + + print_count += 1 + + return panelCircle, myPatches, final_radius + +def run(args): + redwood(args) diff --git a/pauvre/stats.py b/pauvre/stats.py new file mode 100644 index 0000000..d1cdbdb --- /dev/null +++ b/pauvre/stats.py @@ -0,0 +1,250 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore PhD student's plotting package +# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# TODO: make a function to nicely print out the pandas dataframes + +""" +Program: pauvre stats + +example usage/output: + - For example, if I want to know the number of bases and reads with AT LEAST + PHRED score 5 and AT LEAST length of 500, I can run the program as you + see below and look at the cells highlighted with <braces>. + - We find that there are 968,653 basepairs contained in 859 reads that + fit meanReadQuality >= Q5, readLen >= 500. + > pauvre marginplot --fastq miniDSMN15.fastq + + >numReads: 1000 + >numBasepairs: 1029114 + >meanLen: 1029.114 + >medianLen: 875.5 + >minLen: 11 + >maxLen: 5337 + >N50: 1278 + >L50: 296 + + > Basepairs >= bin by mean PHRED and length + >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0 + > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0 + > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0 + > et cetera... + > Number of reads >= bin by mean Phred+Len + >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + > 0 1000 969 865 366 118 22 3 2 1 0 + > 500 873 <859> 789 347 113 20 3 2 1 0 + > 1000 424 418 396 187 62 11 1 1 1 0 + > et cetera... +""" + + +from pauvre.functions import parse_fastq_length_meanqual +import os +import pandas as pd +import numpy as np + + +def stats(df, fastqName, histogram=False): + """ + arguments: + <df> + - a pandas dataframe with cols ['length', 'meanQual'] that contains read + length and quality information + <fastqName> + - just the name of the fastq file. This is used for printing out headers + for the data tables. + + purpose: + Calculate and output various stats about this fastq file. Currently + this program reports: + - Total number of reads + - Total number of basepairs + - mean + - median + - min + - max + - N50 + - A table of num basepairs filtered by min mean PHRED and length + - A table of num reads filtered by the same as above ^ + + example usage/output: + - For example, if I want to know the number of bases and reads with AT LEAST + PHRED score 5 and AT LEAST length of 500, I can run the program as you + see below and look at the cells highlighted with <braces>. + - We find that there are 968,653 basepairs contained in 859 reads that + fit meanReadQuality >= Q5, readLen >= 500. + > pauvre marginplot --fastq miniDSMN15.fastq + + >numReads: 1000 + >numBasepairs: 1029114 + >meanLen: 1029.114 + >medianLen: 875.5 + >minLen: 11 + >maxLen: 5337 + >N50: 1278 + >L50: 296 + + > Basepairs >= bin by mean PHRED and length + >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0 + > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0 + > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0 + > et cetera... + > Number of reads >= bin by mean Phred+Len + >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 + > 0 1000 969 865 366 118 22 3 2 1 0 + > 500 873 <859> 789 347 113 20 3 2 1 0 + > 1000 424 418 396 187 62 11 1 1 1 0 + > et cetera... + """ + fastqBase = os.path.basename(fastqName) + + analysis_sizes = [0, 1000, 5000, 10000] + print_string = "" + for this_size in analysis_sizes: + these_lengths = df.loc[df["length"] >= this_size, "length"] + if len(these_lengths) > 0: + print_string += "# Fastq stats for {}, reads >= {}bp\n".format(fastqBase, this_size) + print_string += "numReads: {}\n".format(len(these_lengths)) + print_string += "%totalNumReads: {0:.2f}\n".format((len(these_lengths) / len(df)) * 100) + print_string += "numBasepairs: {}\n".format(sum(these_lengths)) + print_string += "%totalBasepairs: {0:.2f}\n".format( + (sum(these_lengths) / sum(df["length"])) * 100) + print_string += "meanLen: {}\n".format(np.mean(these_lengths)) + print_string += "medianLen: {}\n".format(np.median(these_lengths)) + print_string += "minLen: {}\n".format(min(these_lengths)) + maxLen = max(these_lengths) + print_string += "maxLen: {}\n".format(maxLen) + + # calculate the N50 + fiftypercent = 0.5 * sum(these_lengths) + lenSum = 0 + count = 0 + for val in sorted(these_lengths, reverse=True): + lenSum += val + count += 1 + if lenSum >= fiftypercent: + print_string += "N50: {}\n".format(int(val)) + print_string += "L50: {}\n".format(count) + break + print_string += "\n" + + print_string += lengthQual_table(df) + + if histogram: # now make a histogram with read lengths + histogram_lengths(df["length"], fastqBase.split('.')[0]) + print(print_string) + + +def lengthQual_table(df): + """Create a table with lengths/basepairs on columns and qualities on rows""" + # This block calculates the number of length bins for this data set + lengthBinList = [] + size_map = [(1000, 250), + (10000, 500), + (40000, 1000), + (100000, 5000), + (500000, 20000), + (1000000, 50000), + (10000000000, 100000)] + # first, figure out where we will start the table + minlen = min(df["length"]) + current_val = 0 + firstDone = False + for this_max_size, this_size_step in size_map: + for this_bin in range(current_val, this_max_size, this_size_step): + if minlen < this_bin: + if not firstDone: + lengthBinList.append(prev) + firstDone = True + lengthBinList.append(this_bin) + prev = this_bin + current_val = this_max_size + # now figure out the largest bin + maxLen = df["length"].max() + first_index_gt_maxLen = next(i for i, v in enumerate(lengthBinList) if v > maxLen) + 1 + lengthBinList = lengthBinList[0:first_index_gt_maxLen] + + qualBinList = [] + increment_by = 1 + while len(qualBinList) == 0 or len(qualBinList) > 15: + # now set up the bins for mean PHRED + minQual = int(np.floor(min(df["meanQual"]))) + maxQual = int(np.ceil(max(df["meanQual"]))) + qualBinList = list(np.arange(minQual, maxQual + increment_by, increment_by)) + increment_by += 0.25 + + # now make a table of read lengths + bpTots = [] + readnumTots = [] + for row in range(len(lengthBinList)): + dataNums = [] + readNums = [] + for column in range(len(qualBinList)): + thisQuery = df.query("length >= {} and meanQual >= {}".format( + lengthBinList[row], qualBinList[column])) + dataNums.append(sum(thisQuery['length'])) + readNums.append(len(thisQuery.index)) + bpTots.append(dataNums) + readnumTots.append(readNums) + + tables = {"Basepairs >= bin by mean PHRED and length": bpTots, + "Number of reads >= bin by mean Phred+Len": readnumTots} + print_table = "" + for key in sorted(tables): + # make a dataframe of our basepair distribution table + dataDf = pd.DataFrame(tables[key], columns=["Q{}".format(x) for x in qualBinList]) + # add the min lengths as a column + dataDf.insert(0, 'minLen', lengthBinList) + print_table += pretty_print_table(dataDf, key) + return print_table + + +def histogram_lengths(length, name_prefix): + """Create a histogram of read counts per length.""" + counts = length.value_counts().to_frame(name="readCount") + counts.index.rename('readLen', inplace=True) + counts.sort_index(inplace=True) + counts.to_csv("{}.hist.csv".format(name_prefix), index=True) + + +def pretty_print_table(df, title): + print_string = "" + dataframeStr = df.to_string(index=False) + # this is the char width of the whole table printed + lendataframeStr = len(dataframeStr.split('\n')[0]) + # this is the char width of the minlen portion of the printed table + minLenLen = len(dataframeStr.split()[0]) + blank = " " * minLenLen + # center the text on this offset as the table title + txtoffset = lendataframeStr - minLenLen + print_string += "\n{}{:^{offset}}\n".format( + blank, title, offset=txtoffset) + print_string += dataframeStr + "\n" + return print_string + + +def run(args): + """This just opens the fastq file and passes the info to the stats() function. + This is a wrapper function that is accessed by pauvre_main. + Useful since we can call stats() independently from other pauvre programs.""" + df = parse_fastq_length_meanqual(args.fastq) + stats(df, args.fastq, histogram=args.histogram) diff --git a/pauvre/synplot.py b/pauvre/synplot.py new file mode 100644 index 0000000..724a187 --- /dev/null +++ b/pauvre/synplot.py @@ -0,0 +1,578 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# TODO +# import the pauvre rcParams +# Cleanup everything + +import pandas as pd +pd.set_option('display.max_columns', 500) +pd.set_option('display.width', 1000) +import numpy as np +import os +import pauvre.rcparams as rc +from pauvre.functions import GFFParse, print_images, timestamp +from pauvre import gfftools +from pauvre.lsi.lsi import intersection +import progressbar +import platform +import sys +import time +import warnings + +# for the shuffling algorithm +from itertools import product + +# Biopython stuff +from Bio import SeqIO +import Bio.SubsMat.MatrixInfo as MI + +# following this tutorial to install helvetica +# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md +global hfont +hfont = {'fontname':'Helvetica'} + +import matplotlib +matplotlib.use('agg') +import matplotlib.pyplot as plt +from matplotlib.colors import LinearSegmentedColormap, Normalize +import matplotlib.patches as patches + +def shuffle_optimize_gffs(args, GFFs): + """This function takes in a list of GFF objects and reshuffles the + individual files such that the resulting sequence of GFF files has + the minimum number of intersections when plotting synteny + + if args.optimum_order, then the program will find the global minimum + arrangement using the first GFF file as the anchor. + + if not args.optimum_order, then the program will find the local minimum + shuffle between every input pair of GFF files to plot in the best way possible + given the input order. + + returns a list of GFF files from which the user can calculate plotting coordinates + """ + # we use the first-input gff as the topmost sequence, + # and then find the best synteny match for the remaining sequences + shuffled_gffs = [] + if args.optimum_order: + firstgff = GFFs[0] + # save the first gff file unadultered + shuffled_gffs.append(firstgff) + nextgffs = GFFs[1:] + while len(nextgffs) > 0: + obs_list = [] + for i in range(len(nextgffs)): + # every observation will be stored here as a tuple. + # zeroth element is the num intersections with the current gff + # first element is the index of nextgffs + # second element is the GFF object + shuffles = nextgffs[i].shuffle() + for k in range(len(shuffles)): + shuf = shuffles[k] + coords = firstgff.couple(shuf, this_y = 0, other_y = 1) + num_inters = len(intersection(coords)) + obs_list.append((num_inters, i, shuf)) + #print(obs_list[-1]) + intersections, gffixs, shufs = zip(*obs_list) + # get the index of the shuffled gff with the least number of + # intersections to the current one against which we are comparing + #print("intersections", intersections) + selected_ix = intersections.index(min(intersections)) + # save this gff to shuffled gffs to use later for plotting + shuffled_gffs.append(shufs[selected_ix]) + # remove the origin of the shuffled gff from nextgffs + del nextgffs[gffixs[selected_ix]] + #print("global minimum was {} intersections".format(min(intersections))) + # now update the firstgff to the latest shuffled one we collected + firstgff = shufs[selected_ix] + # plot the gff files in the order in which you input them, + # but shuffle them to the order with least intersections + else: + # first we need to find the best arrangement by finding the combinations + # that share the most unique genes + genes_series = [GFFs[i].get_unique_genes() for i in range(len(GFFs))] + combinations_indices = [0] + remaining_indices = list(range(1, len(GFFs))) + done = False + biggest_intersection_index = -1 + biggest_intersection_value = 0 + current_remaining_indices_index = 0 + while not done: + #get the len of the intersection + #print("combinations_indices: {}".format(combinations_indices)) + #print("current_remaining_indices_index: {}".format(current_remaining_indices_index)) + #print("remaining_indices[current_remaining_indices_index]: {}".format(remaining_indices[current_remaining_indices_index])) + #print("genes_series[remaining_indices[current_remaining_indices_index]]: {}".format(genes_series[remaining_indices[current_remaining_indices_index]])) + this_intersection_value = len(genes_series[combinations_indices[-1]] &\ + genes_series[remaining_indices[current_remaining_indices_index]]) + if this_intersection_value > biggest_intersection_value: + biggest_intersection_value = this_intersection_value + biggest_intersection_index = current_remaining_indices_index + if current_remaining_indices_index < len(remaining_indices)-1: + current_remaining_indices_index += 1 + else: + combinations_indices.append(remaining_indices[biggest_intersection_index]) + del remaining_indices[biggest_intersection_index] + biggest_intersection_value = 0 + current_remaining_indices_index = 0 + biggest_intersection_index = -1 + if len(remaining_indices) == 0: + done = True + # The best order of genes with the most shared genes + #I don't know if this is really that useful though since many species will overlap. + # In a future implementation of this program it might be necessary to do sub-sorting of this list to get the lest number of line intersections + print("The best gene combination is {}".format(combinations_indices)) + # now we rearrange the GFFs to the best order + new_GFFs = [GFFs[i] for i in combinations_indices] + # If we're adding another copy of the top one, add it here before shuffling + if args.sandwich: + new_GFFs.append(new_GFFs[0]) + shuffles = [new_GFFs[i].shuffle() for i in range(len(new_GFFs))] + #print([len(shuffles[i]) for i in range(len(shuffles))]) + cumulative_least_shuffled_value = 999999999999999999999999999999999999 + bar = progressbar.ProgressBar() + for combination in bar(list(product(*shuffles))): + num_intersections = [] + #I have no idea what this list comprehension does anymore. + first_genes = [str(combination[i].features[combination[i].features['featType'].isin(['gene', 'rRNA', 'CDS', 'tRNA'])]['name'].head(n=1)).split()[1] for i in range(len(combination))] + # skip to the next iteration if all the genes aren't the same + if args.start_with_aligned_genes and len(set(first_genes)) != 1: + continue + for i in range(len(new_GFFs) - 1): + j = i + 1 + #figure out the best shuffle the next sequence + coords = combination[i].couple(combination[j], this_y = i, other_y = j) + num_intersections.append(len(intersection(coords))) + if sum(num_intersections) < cumulative_least_shuffled_value: + shuffled_gffs = combination + cumulative_least_shuffled_value = sum(num_intersections) + print("\nnew fewest global intersections: {}".format(sum(num_intersections))) + return shuffled_gffs + +def black_colormap(): + zeroone = np.linspace(0, 1, 100) + colorrange = [(0,0,0,x) for x in zeroone] + minblosum = min(MI.blosum62.values()) + maxblosum = max(MI.blosum62.values()) + colormap = {i: colorrange[int(translate(i, minblosum, maxblosum, 0, 99))] + for i in range(minblosum, maxblosum + 1, 1)} + return colormap + +def translate(value, left_min, left_max, right_min, right_max): + """This code maps values from the left range and interpolates to the + corresponding range on the right. This is used to translate the amino acid + substition matrix scores to a scale between 0 and 1 for making alphamaps. + + I don't know if this works if the directionality of the ranges are swapped. + IE [5, -10] mapped to [0, 1] + + args: + <value> - the value in [<left_min>:<left_max>] to scale between + [<right_min>:<right_max>] + <left_min> - the 'min' of the left (source) range + <left_max> - the 'max' of the left (source) range + <right_min> - the 'min' of the right (target) range + <right_max> the 'max' of the right (target) range + + output: + the <value>(float) scaled between <right_min> and <right_max> + """ + # Figure out how 'wide' each range is + left_span = left_max - left_min + right_span = right_max - right_min + + # Convert the left range into a 0-1 range (float) + value_scaled = float(value - left_min) / float(left_span) + + # Convert the 0-1 range into a value in the right range. + return right_min + (value_scaled * right_span) + +def _samplename_warning(samplename, filename): + warnings.warn(""" + There is a sample in your fasta alignments that + does not match the samplenames from the gff filenames. Please + rename this samplename to not contain any spaces or underscores. + IE for sample 'NC016', '>NC_016_-_ND6' will not work but + '>NC016_-_ND6' will work. + + Erroneous name: {} + File: {}""".format(samplename, os.path.basename(filename))) + +def _samplelength_warning(samplename, genename, featType, gfflen, alnlen): + raise Warning("""The length of the protein alignment isn't the same as the + length in the GFF file for the sample. Maybe you used a sequence in the + alignment that is different from the annotation source? Check if the + stop codons are deleted/inserted from either the GFF or alignment. The + protein alignment length should be 3 less than the gff length if the + stop codons were included in the gff annotation. + + Another possibility is that the RNA that is generating this error has + post-transcriptional modifications that complete the stop codon. In + this case, you can fudge the stop position in the gff file (increase + the value by one or two) to make the plotting script run. + + Sample name: {} + feat type: {} + gene name: {} + gff length: {} + aln length: {}""".format(samplename, featType, genename, gfflen, alnlen)) + +def _nosample_warning(samplename, alngenename, gffnames): + raise Warning("""One of the gff files doesn't contain a sequence that the + alignment file indicates should be present. Either the alignment file + is misnamed or the sequence name in the GFF file is not what you + intended. + + Sample name: {} + aln gene name: {} + gff names: {}""".format(samplename, alngenename, gffnames)) + +def get_alignments(args): + """ + this reads in all the alignments from the fasta directory. + """ + # This is a dict object with key as + filelist = {os.path.splitext(x)[0]:os.path.join(os.path.abspath(args.aln_dir), x) + for x in os.listdir(args.aln_dir) + if os.path.splitext(x)[1]} + print("file list is:") + print(filelist) + # one entry in seq_dict is: + # {seqname: {"featType": featType, + # "seqs": {samplename: seq}, + # "indices": {samplename: indices}} + seqs_dict = {} + # go through every gene in the genelist + for genename in filelist: + thisFeatType = "" + seqs_list = {} + indices_list = {} + #print("We found the following samplenames: {}".format(args.samplenames), file = sys.stderr) + # this block handles reading in the fasta files to interpret the alignments + for record in SeqIO.parse(filelist[genename], "fasta"): + # get the sample name and make sure that the sample names match + samplename = record.id.replace("_", " ").split()[0] + #print("Looking at sample: {}".format(samplename), file=sys.stderr) + if samplename not in args.samplenames: + #if there's a sequence in the fasta that we did not specify + # in the command, ignore that sequence + _samplename_warning(samplename, filelist[genename]) + else: + # first, get the sample features + samplegff = args.samplenames[samplename].features + featType = samplegff.loc[samplegff['name'] == genename, 'featType'].to_string().split()[1] + # now we determine if this is a prot alignment or a nucleotide aln + if featType in ['gene', 'CDS']: + final_seq = "".join([x*3 for x in record.seq]) + elif featType == 'rRNA': + final_seq = str(record.seq) + # we now need to verify that the protein sequence is + # the length of the gene in the gff file. Do this by removing + # gaps in the alignment + gfffilt = samplegff.loc[samplegff['name'] == genename, 'width'] + if len(gfffilt) == 0: + _nosample_warning(samplename, genename, list(samplegff['name'])) + gfflen = int(gfffilt) + aln = final_seq.replace("-", "") + alnlen = len(aln) + if gfflen != alnlen: + _samplelength_warning(samplename, genename, featType, gfflen, alnlen) + # If we've made it this far without any errors, then incorporate the + # indices for each index + #print("start_index", start_index) + final_indices = [-1] * len(final_seq) + # up until the next for loop, here we are determining which + # direction to move in. Reverse sequences decrease from the start + strand = samplegff.loc[samplegff['name'] == genename, 'strand'].to_string().split()[1] + if strand == '+': + direction = 1 + start_index = int(samplegff.loc[samplegff['name'] == genename, 'start']) + elif strand == '-': + direction = -1 + start_index = int(samplegff.loc[samplegff['name'] == genename, 'stop']) + for i in range(len(final_indices)): + if final_seq[i] != '-': + final_indices[i] = start_index + start_index = start_index + (1 * direction) + seqs_list[samplename] = final_seq + if args.center_on: + center_coord = int(args.samplenames[samplename].features.loc[args.samplenames[samplename].features['name'] == args.center_on, 'center']) + indices_list[samplename] = np.array(final_indices) - center_coord + else: + indices_list[samplename] = final_indices + thisFeatType = featType + seqs_dict[genename] = {"featType": thisFeatType, + "seqs": seqs_list, + "indices": indices_list} + return seqs_dict + + +def plot_synteny(seq1, ind1, seq2, ind2, y1, y2, + featType, matrix, cm, seqname): + """This function plots all the lines for each""" + print("PLOTTING SYNTENY") + myPatches = [] + colormap = {"COX1": '#c0d9ef', + "L": '#e8f1df', + "I": '#f7dedc', + "16S": '#ff2e00', + "12S": '#ffc239', + "cal": '#ffff54', + "COX2": "#7fce66", + "ND2": "#00ae60", + "COX3": "#00aeec", + "ND1": "#006fbb", + "*": "#ffffff", + "(": "#ded9c5", + "Q": "#ffc294", + "?": "#b5a2c4", + "ND4": "#968b5a", + "ND3": "#00fc65", + "ND4L": "#00dcf0", + "ND6": "#ff994e", + "ND5": "#dc31e6", + "X": "#d8d8d8", + "G": "#abdce7", + "CYTB": "#ff0059"} + + for i in range(len(seq1)): + feat1 = seq1[i] + feat2 = seq2[i] + if feat1 != '-' and feat2 != '-': + xs = [] + ys = [] + xs.append(ind1[i]) # top left + ys.append(y1) + xs.append(ind1[i] + 1) # top right + ys.append(y1) + xs.append(ind2[i] + 1) # bottom right + ys.append(y2) + xs.append(ind2[i]) #bottom left + ys.append(y2) + xs.append(ind1[i]) #top left + ys.append(y1) + alpha = 0.5 + if featType in ['CDS', 'gene']: + try: + val = matrix[(feat1, feat2)] + except: + val = matrix[(feat2, feat1)] + color = cm[val] + alpha = color[-1] + elif featType == 'rRNA': + if feat1 != feat2: + alpha=0 + color = colormap[seqname] + stack1 = np.column_stack([xs, ys]) + myPatches.append(patches.Polygon(stack1, closed=True, + color = color, + alpha = alpha, + lw=0)) + return myPatches + +def synplot(args): + rc.update_rcParams() + print(args) + GFFs = [] + for i in range(len(args.gff_paths)): + gffpath = args.gff_paths[i] + species = "" + if args.gff_labels: + species = args.gff_labels[i] + GFFs.append(GFFParse(gffpath, args.stop_codons, species)) + + # find the optimum shuffling pattern + # and add a list of samplenames to the args + optGFFs = shuffle_optimize_gffs(args, GFFs) + # Make a sandwich for a circular comparison + setattr(args, 'samplenames', {gff.samplename:gff for gff in optGFFs}) + + # now get the cms and normalize + #cms, normalize = gen_colormaps() + cm = black_colormap() + + ## and we get the protein alignment objects + # {seqname: {"featType": featType, + # "seqs": {samplename: seq}, + # "indices": {samplename: indices}} + print("getting alignments") + seqs_dict = get_alignments(args) + print("done getting alignments") + print("seqs_dict is:") + print(seqs_dict) + + # set the figure dimensions + if args.ratio: + figWidth = args.ratio[0] + 1 + figHeight = args.ratio[1] + 1 + #set the panel dimensions + panelWidth = args.ratio[0] + panelHeight = args.ratio[1] + + else: + figWidth = 2.5*4 + figHeight = 5 + #set the panel dimensions + panelWidth = 2.5 * 3 + panelHeight = 1.75 + + figure = plt.figure(figsize=(figWidth,figHeight)) + + + #find the margins to center the panel in figure + leftMargin = (figWidth - panelWidth)/2 + bottomMargin = ((figHeight - panelHeight)/2) + 0.25 + + panel0=plt.axes([leftMargin/figWidth, #left + bottomMargin/figHeight, #bottom + panelWidth/figWidth, #width + panelHeight/figHeight]) #height + panel0.tick_params(axis='both',which='both',\ + bottom='on', labelbottom='off',\ + left='off', labelleft='off', \ + right='off', labelright='off',\ + top='off', labeltop='off') + #turn off some of the axes + panel0.spines['top'].set_visible(False) + panel0.spines['right'].set_visible(False) + panel0.spines['left'].set_visible(False) + + # {seqname: {"featType": featType, + # "seqs": {samplename: seq}, + # "indices": {samplename: indices}} + allPatches = [] + for seqname in seqs_dict: + #go through in order + print(seqname) + for i in range(0, len(optGFFs) - 1): + samplei = optGFFs[i].samplename + samplej = optGFFs[i+1].samplename + if samplei in seqs_dict[seqname]["seqs"].keys() and\ + samplej in seqs_dict[seqname]["seqs"].keys(): + featType = seqs_dict[seqname]["featType"] + seq1 = seqs_dict[seqname]["seqs"][samplei] + ind1 = seqs_dict[seqname]["indices"][samplei] + seq2 = seqs_dict[seqname]["seqs"][samplej] + ind2 = seqs_dict[seqname]["indices"][samplej] + # this is the top one, just leave it at the actual value since + # the base of the annotations start on the integer + y1 = len(optGFFs) - 1 - i + # this needs to be increased by the bar_thickness (0.9 * track_width in this case, or 0.09) + y2 = len(optGFFs) - 2 - i + myPatches = plot_synteny(seq1, ind1, seq2, ind2, y1, y2, + featType, MI.blosum62, cm, seqname) + for patch in myPatches: + allPatches.append(patch) + + #print("len allPatches", len(allPatches)) + # this bit plots the simplified lines in the centers + ## first we plot all the lines from the centers of matching genes. + ## This is temporary. Or maybe it should be a feature + #verts = [] + #for i in range(len(optGFFs) - 1): + # j = i + 1 + # coords = optGFFs[i].couple(optGFFs[j], this_y = len(optGFFs) - i, other_y = len(optGFFs) - i - 1) + # for coord in coords: + # verts.append(coord) + + #for vert in verts: + # xxyy = list(zip(*vert)) + # panel0.plot(xxyy[0], xxyy[1]) + # now we plot horizontal lines showing the length of the mitochondrial sequence + maxseqlen = 0 + # this is a heuristic for trackwidth of what looks good in my experience + track_multiplier = 0.08 + if args.ratio: + track_width = track_multiplier * panelWidth + else: + #0.032 if only 3 + #0.062 if 6 + track_width = track_multiplier * panelWidth + for i in range(len(optGFFs)): + gff = optGFFs[i] + #print(" - Plotting panels of {}".format(gff), file = sys.stderr) + x_offset = 0 + #print(" - Detecting if centering is on.".format(gff), file = sys.stderr) + if args.center_on: + x_offset = -1 * int(gff.features.loc[gff.features['name'] == args.center_on, 'center']) + gff = gfftools.x_offset_gff(gff, x_offset) + #print(" - Centering is on.".format(gff), file = sys.stderr) + #print(" - Plotting horizontal portions with gffplot_horizontal.".format(gff), file = sys.stderr) + panel0, patches = gfftools.gffplot_horizontal( + figure, panel0, args, gff, + track_width = track_width, + start_y = len(optGFFs) - i - 1 - ((0.9 * track_width)/2), + x_offset = x_offset) + #print("{} patches came out of gffplot_horizontal()".format(len(patches))) + seq_name = gff.features['sequence'].unique()[0] + if args.gff_labels: + seq_name = "$\it{{{0}}}$".format(gff.species) + panel0.text(0 + x_offset, len(optGFFs) - i - 1 + (0.18/2), + seq_name, fontsize = 12, + ha='left', va='bottom', + color = 'black', + zorder = 100) + + if gff.seqlen > maxseqlen: + maxseqlen = gff.seqlen + xs = (1 + x_offset, gff.seqlen + x_offset) + #ys = [len(optGFFs) - i - 1 + (0.09/2)]*2 + ys = [len(optGFFs) - i - 1]*2 + + #print(" - Plotting lines.".format(gff), file = sys.stderr) + panel0.plot(xs, ys, color='black', zorder = -9) + #print(" - Adding patches.".format(gff), file = sys.stderr) + #print("Right before adding patches there are {} patches.".format(len(patches))) + for i in range(len(patches)): + patch = patches[i] + allPatches.append(patch) + for patch in allPatches: + panel0.add_patch(patch) + panel0.set_xlabel("position (bp)") + + #panel0.set_xlim([-15000, int(np.ceil(maxseqlen/1000)*1000)]) + panel0.set_ylim([ 0 - ( (track_width/2) * 1.1 ), + len(optGFFs) - 1 + ( (track_width/2) * 1.1 )]) + + # This removes the text labels from the plot + labels = [item.get_text() for item in panel0.get_xticklabels()] + empty_string_labels = ['']*len(labels) + print(" - Setting tick labels.".format(gff), file = sys.stderr) + + panel0.set_xticklabels(empty_string_labels) + + # Print image(s) + print(" - Running print_images.".format(gff), file = sys.stderr) + if args.BASENAME is None: + file_base = "synteny" + else: + file_base = args.BASENAME + print_images( + base=file_base, + image_formats=args.fileform, + no_timestamp = args.no_timestamp, + dpi=args.dpi, + transparent=args.transparent) + + +def run(args): + synplot(args) diff --git a/pauvre/tests/__init__.py b/pauvre/tests/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/pauvre/tests/__init__.py diff --git a/pauvre/tests/test_synplot.py b/pauvre/tests/test_synplot.py new file mode 100644 index 0000000..186ef0c --- /dev/null +++ b/pauvre/tests/test_synplot.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +# pauvre - just a pore plotting package +# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +import os +import subprocess as sp +import unittest + +class libSeq_test_case(unittest.TestCase): + """Tests that different features of synplot work correctly + """ + def setUp(self): + # Here we must safely make the test directory if it does + # not exist + scriptdir = os.path.dirname(os.path.realpath(__file__)) + testoutputdir = os.path.join(scriptdir, "testresults") + if not os.path.exists(testoutputdir): + os.makedirs(testoutputdir) + # now we see if the specific subdirectory for the program + # we are testing exists + self.thisoutdir = os.path.join(testoutputdir, "synplot") + if not os.path.exists(self.thisoutdir): + os.makedirs(self.thisoutdir) + # now we change our working directory to the + # specific subdirectory for this file + + self.aln_dir = os.path.join(scriptdir, "testdata/alignments/") + self.gff1 = os.path.join(scriptdir, "testdata/gff_files/Bf201706.gff") + self.gff2 = os.path.join(scriptdir, "testdata/gff_files/JN392469.gff") + self.gff3 = os.path.join(scriptdir, "testdata/gff_files/NC016117.gff") + + + def test_normal_plotting_scenario(self): + """This verifies that the LibSeq class is constructed with all of the + parameters that are present in the meraculous config files""" + os.chdir(self.thisoutdir) + thiscommand = """pauvre synplot --aln_dir {0} \ + --fileform pdf \ + --gff_paths {1} {2} {3} \ + --center_on COX1 \ + --gff_labels "B. forskalii" "P. bachei" "M. leidyi" """.format( + self.aln_dir, self.gff1, + self.gff2, self.gff3) + + data = sp.Popen(thiscommand, shell = True, + stdout=sp.PIPE) + print("The data is:", data.communicate()[0].decode()) + print("The return code is: ", data.returncode) + self.assertEqual(0, int(data.returncode)) diff --git a/pauvre/version.py b/pauvre/version.py new file mode 100644 index 0000000..3a6a9f4 --- /dev/null +++ b/pauvre/version.py @@ -0,0 +1,4 @@ +# -*- coding: utf-8 -*- + +__version__ = "0.2.3" + diff --git a/scripts/test.sh b/scripts/test.sh new file mode 100644 index 0000000..d419aea --- /dev/null +++ b/scripts/test.sh @@ -0,0 +1,14 @@ +set -ev + +git clone https://github.com/wdecoster/nanotest.git + +pauvre -h + +pauvre marginplot -h +pauvre marginplot -f nanotest/reads.fastq.gz +pauvre marginplot -f nanotest/reads.fastq.gz -t "my title" --filt_maxlen 30000 +pauvre marginplot -f nanotest/reads.fastq.gz -y --filt_minqual 7 --fileform svg + +pauvre stats -h +pauvre stats -f nanotest/reads.fastq.gz +pauvre stats -f nanotest/reads.fastq.gz -H --filt_maxqual 12 diff --git a/setup.cfg b/setup.cfg new file mode 100644 index 0000000..8bfd5a1 --- /dev/null +++ b/setup.cfg @@ -0,0 +1,4 @@ +[egg_info] +tag_build = +tag_date = 0 + diff --git a/setup.py b/setup.py new file mode 100755 index 0000000..7e38014 --- /dev/null +++ b/setup.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +# pauvre +# Copyright (c) 2016-2020 Darrin T. Schultz. +# +# This file is part of pauvre. +# +# pauvre is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# pauvre is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with pauvre. If not, see <http://www.gnu.org/licenses/>. + +# Tutorials on how to setup python package here: +# - http://python-packaging.readthedocs.io/en/latest/testing.html +# - https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/ + +import os +from setuptools import setup, find_packages + +version_py = os.path.join(os.path.dirname(__file__), 'pauvre', 'version.py') +version = open(version_py).read().strip().split('=')[-1].replace('"', '').strip() + + +setup(name='pauvre', + version=version, + description='Tools for plotting Oxford Nanopore and other long-read data.', + long_description=""" + 'pauvre' is a package for plotting Oxford Nanopore and other long read data. + The name means 'poor' in French, a play on words to the oft-used 'pore' prefix + for similar packages. This package was designed for python 3, but it might work in + python 2. You can visit the gitub page for more detailed information here: + https://github.com/conchoecia/pauvre + """, + url='https://github.com/conchoecia/pauvre', + author='Darrin Schultz', + author_email='dts@ucsc.edu', + classifiers=[ + 'Development Status :: 2 - Pre-Alpha', + 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)', + 'Programming Language :: Python :: 3', + 'Programming Language :: Python :: 3.6', + 'Programming Language :: Python :: 3.7', + 'Operating System :: POSIX :: Linux', + 'Topic :: Scientific/Engineering :: Bio-Informatics', + 'Intended Audience :: Science/Research', + ], + packages=find_packages(), + python_requires='>=3', + install_requires=[ + "matplotlib >= 2.0.2", + "biopython >= 1.68", + "pandas >= 0.20.1", + "numpy >= 1.12.1", + "scipy", + "scikit-learn", + ], + entry_points={ + 'console_scripts': ['pauvre=pauvre.pauvre_main:main'], + }, + zip_safe=False, + include_package_data=True) |