summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNilesh Patra <nilesh@debian.org>2021-08-15 15:17:57 +0200
committerNilesh Patra <nilesh@debian.org>2021-08-15 15:17:57 +0200
commitc814969eef5135efcbd686f55b0c7e6ea99af695 (patch)
tree2c5fdf3cdbb43c3dd0fc5575670cec02d87d3b04
Import python-pauvre_0.2.3.orig.tar.gz
[dgit import orig python-pauvre_0.2.3.orig.tar.gz]
-rw-r--r--MANIFEST.in1
-rw-r--r--PKG-INFO25
-rw-r--r--README.md146
-rw-r--r--pauvre.egg-info/PKG-INFO25
-rw-r--r--pauvre.egg-info/SOURCES.txt32
-rw-r--r--pauvre.egg-info/dependency_links.txt1
-rw-r--r--pauvre.egg-info/entry_points.txt3
-rw-r--r--pauvre.egg-info/not-zip-safe1
-rw-r--r--pauvre.egg-info/requires.txt6
-rw-r--r--pauvre.egg-info/top_level.txt1
-rw-r--r--pauvre/__init__.py1
-rw-r--r--pauvre/bamparse.py211
-rw-r--r--pauvre/browser.py426
-rw-r--r--pauvre/custommargin.py441
-rw-r--r--pauvre/functions.py347
-rw-r--r--pauvre/gfftools.py581
-rwxr-xr-xpauvre/lsi/Q.py77
-rwxr-xr-xpauvre/lsi/T.py322
-rw-r--r--pauvre/lsi/__init__.py0
-rwxr-xr-xpauvre/lsi/helper.py94
-rwxr-xr-xpauvre/lsi/lsi.py107
-rwxr-xr-xpauvre/lsi/test.py110
-rw-r--r--pauvre/marginplot.py401
-rw-r--r--pauvre/pauvre_main.py636
-rw-r--r--pauvre/rcparams.py25
-rw-r--r--pauvre/redwood.py1047
-rw-r--r--pauvre/stats.py250
-rw-r--r--pauvre/synplot.py578
-rw-r--r--pauvre/tests/__init__.py0
-rw-r--r--pauvre/tests/test_synplot.py66
-rw-r--r--pauvre/version.py4
-rw-r--r--scripts/test.sh14
-rw-r--r--setup.cfg4
-rwxr-xr-xsetup.py70
34 files changed, 6053 insertions, 0 deletions
diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 0000000..85feea3
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1 @@
+include scripts/test.sh
diff --git a/PKG-INFO b/PKG-INFO
new file mode 100644
index 0000000..b709225
--- /dev/null
+++ b/PKG-INFO
@@ -0,0 +1,25 @@
+Metadata-Version: 1.2
+Name: pauvre
+Version: 0.2.3
+Summary: Tools for plotting Oxford Nanopore and other long-read data.
+Home-page: https://github.com/conchoecia/pauvre
+Author: Darrin Schultz
+Author-email: dts@ucsc.edu
+License: UNKNOWN
+Description:
+ 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
+ The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
+ for similar packages. This package was designed for python 3, but it might work in
+ python 2. You can visit the gitub page for more detailed information here:
+ https://github.com/conchoecia/pauvre
+
+Platform: UNKNOWN
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.6
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Intended Audience :: Science/Research
+Requires-Python: >=3
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..fa14729
--- /dev/null
+++ b/README.md
@@ -0,0 +1,146 @@
+[![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670)
+
+## <a name="started"></a>Getting Started
+
+```
+pauvre custommargin -i custom.tsv --ycol length --xcol qual # Custom tsv input
+```
+
+## Table of Contents
+
+- [Getting Started](#started)
+- [Users' Guide](#uguide)
+- [Installation](#installation)
+ - [Requirements](#reqs)
+ - [Install Instructions](#install)
+- [Usage](#usage)
+ - [pauvre stats](#stats)
+ - [pauvre marginplot](#marginplot)
+ - [Basic Usage](#marginbasic)
+ - [Plot Adjustments](#marginadjustments)
+ - [Specialized Options](#marginspecialized)
+- [Contributors](#contributors)
+
+## <a name="uguide"></a>Users' Guide
+
+Pauvre is a plotting package originally designed to help QC the length and
+quality distribution of Oxford Nanopore or PacBio reads. The main outputs
+are marginplots. Now, `pauvre` also hosts other additional data plotting
+scripts.
+
+This package currently hosts five scripts for plotting and/or printing stats.
+
+- `pauvre marginplot`
+ - takes a fastq file as input and outputs a marginal histogram with a heatmap.
+- `pauvre custommargin`
+ - takes a tsv as input and outputs a marginal histogram with custom columns of your choice.
+- `pauvre stats`
+ - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff.
+ - This is also automagically called when using `pauvre marginplot`
+- `pauvre redwood`
+ - I am happy to introduce the redwood plot to the world as a method
+ of representing circular genomes. A redwood plot contains long
+ reads as "rings" on the inside, a gene annotation
+ "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files
+ for the long reads and RNAseq data, and a `.gff` file for the
+ annotation. More details to follow as we document this program
+ better...
+- `pauvre synteny`
+ - Makes a synteny plot of circular genomes. Finds the most
+ parsimonius rotation to display the synteny of all the input
+ genomes with the fewest crossings-over. Input is one `.gff` file
+ per circular genome and one directory of gene alignments.
+
+
+## <a name="installation"></a>Installation
+
+### <a name="reqs"></a>Requirements
+
+- You must have the following installed on your system to install this software:
+ - python 3.x
+ - matplotlib
+ - biopython
+ - pandas
+ - pillow
+
+### <a name="install">Install Instructions
+- Instructions to install on your mac or linux system. Not sure on
+ Windows! Make sure *python 3* is the active environment before
+ installing.
+ - `git clone https://github.com/conchoecia/pauvre.git`
+ - `cd ./pauvre`
+ - `pip3 install .`
+- Or, install with pip
+ - `pip3 install pauvre`
+
+## <a name="usage"><a/>Usage
+
+### <a name="stats"></a>`stats`
+ - generate basic statistics about the fastq file. For example, if I
+ want to know the number of bases and reads with AT LEAST a PHRED
+ score of 5 and AT LEAST a read length of 500, run the program as below
+ and look at the cells highlighted with `<braces>`.
+ - `pauvre stats --fastq miniDSMN15.fastq`
+
+
+```
+numReads: 1000
+numBasepairs: 1029114
+meanLen: 1029.114
+medianLen: 875.5
+minLen: 11
+maxLen: 5337
+N50: 1278
+L50: 296
+
+ Basepairs >= bin by mean PHRED and length
+minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
+ 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
+ 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
+ et cetera...
+ Number of reads >= bin by mean Phred+Len
+minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ 0 1000 969 865 366 118 22 3 2 1 0
+ 500 873 <859> 789 347 113 20 3 2 1 0
+ 1000 424 418 396 187 62 11 1 1 1 0
+ et cetera...
+```
+
+### <a name="marginplot"></a>`marginplot`
+
+#### <a name="marginbasic"></a>Basic Usage
+- automatically calls `pauvre stats` for each fastq file
+- Make the default plot showing the 99th percentile of longest reads
+ - `pauvre marginplot --fastq miniDSMN15.fastq`
+ - ![default](files/default_miniDSMN15.png)
+- Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a
+ lower maxlen and higher maxqual.
+ - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq`
+ - ![example1](files/miniDSMN15.png)
+
+#### <a name="marginadjustments"></a>Plot Adjustments
+
+- Filter out reads with a mean quality less than 5, and a length
+ less than 800. Zoom in to plot only mean quality of at least 4 and
+ read length at least 500bp.
+ - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4`
+ - ![test4](files/test4.png)
+
+#### <a name="marginspecialized"></a>Specialized Options
+
+- Plot ONT 1D data with a large tail
+ - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 <myfile>.fastq`
+- Get more resolution on lengths
+ - `pauvre marginplot --maxlen 100000 --lengthbin 5 <myfile>.fastq`
+
+- Turn off transparency if you just want a white background
+ - `pauvre marginplot --transparent False <myfile>.fastq`
+ - Note: transparency is the default behavior
+ - ![transparency](files/transparency.001.jpeg)
+
+## <a name="contributors"></a>Contributors
+
+@conchoecia (Darrin Schultz)
+@mebbert (Mark Ebbert)
+@wdecoster (Wouter De Coster)
diff --git a/pauvre.egg-info/PKG-INFO b/pauvre.egg-info/PKG-INFO
new file mode 100644
index 0000000..b709225
--- /dev/null
+++ b/pauvre.egg-info/PKG-INFO
@@ -0,0 +1,25 @@
+Metadata-Version: 1.2
+Name: pauvre
+Version: 0.2.3
+Summary: Tools for plotting Oxford Nanopore and other long-read data.
+Home-page: https://github.com/conchoecia/pauvre
+Author: Darrin Schultz
+Author-email: dts@ucsc.edu
+License: UNKNOWN
+Description:
+ 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
+ The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
+ for similar packages. This package was designed for python 3, but it might work in
+ python 2. You can visit the gitub page for more detailed information here:
+ https://github.com/conchoecia/pauvre
+
+Platform: UNKNOWN
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.6
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Intended Audience :: Science/Research
+Requires-Python: >=3
diff --git a/pauvre.egg-info/SOURCES.txt b/pauvre.egg-info/SOURCES.txt
new file mode 100644
index 0000000..f92fd64
--- /dev/null
+++ b/pauvre.egg-info/SOURCES.txt
@@ -0,0 +1,32 @@
+MANIFEST.in
+README.md
+setup.py
+pauvre/__init__.py
+pauvre/bamparse.py
+pauvre/browser.py
+pauvre/custommargin.py
+pauvre/functions.py
+pauvre/gfftools.py
+pauvre/marginplot.py
+pauvre/pauvre_main.py
+pauvre/rcparams.py
+pauvre/redwood.py
+pauvre/stats.py
+pauvre/synplot.py
+pauvre/version.py
+pauvre.egg-info/PKG-INFO
+pauvre.egg-info/SOURCES.txt
+pauvre.egg-info/dependency_links.txt
+pauvre.egg-info/entry_points.txt
+pauvre.egg-info/not-zip-safe
+pauvre.egg-info/requires.txt
+pauvre.egg-info/top_level.txt
+pauvre/lsi/Q.py
+pauvre/lsi/T.py
+pauvre/lsi/__init__.py
+pauvre/lsi/helper.py
+pauvre/lsi/lsi.py
+pauvre/lsi/test.py
+pauvre/tests/__init__.py
+pauvre/tests/test_synplot.py
+scripts/test.sh \ No newline at end of file
diff --git a/pauvre.egg-info/dependency_links.txt b/pauvre.egg-info/dependency_links.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/pauvre.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/pauvre.egg-info/entry_points.txt b/pauvre.egg-info/entry_points.txt
new file mode 100644
index 0000000..704e4fa
--- /dev/null
+++ b/pauvre.egg-info/entry_points.txt
@@ -0,0 +1,3 @@
+[console_scripts]
+pauvre = pauvre.pauvre_main:main
+
diff --git a/pauvre.egg-info/not-zip-safe b/pauvre.egg-info/not-zip-safe
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/pauvre.egg-info/not-zip-safe
@@ -0,0 +1 @@
+
diff --git a/pauvre.egg-info/requires.txt b/pauvre.egg-info/requires.txt
new file mode 100644
index 0000000..2df747c
--- /dev/null
+++ b/pauvre.egg-info/requires.txt
@@ -0,0 +1,6 @@
+matplotlib>=2.0.2
+biopython>=1.68
+pandas>=0.20.1
+numpy>=1.12.1
+scipy
+scikit-learn
diff --git a/pauvre.egg-info/top_level.txt b/pauvre.egg-info/top_level.txt
new file mode 100644
index 0000000..57f5848
--- /dev/null
+++ b/pauvre.egg-info/top_level.txt
@@ -0,0 +1 @@
+pauvre
diff --git a/pauvre/__init__.py b/pauvre/__init__.py
new file mode 100644
index 0000000..81f6fbf
--- /dev/null
+++ b/pauvre/__init__.py
@@ -0,0 +1 @@
+from pauvre.version import __version__
diff --git a/pauvre/bamparse.py b/pauvre/bamparse.py
new file mode 100644
index 0000000..5791a2c
--- /dev/null
+++ b/pauvre/bamparse.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+# twitter @conchoecia
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+import pysam
+import pandas as pd
+import os
+
+class BAMParse():
+ """This class reads in a sam/bam file and constructs a pandas
+ dataframe of all the relevant information for the reads to pass on
+ and plot.
+ """
+ def __init__(self, filename, chrid = None, start = None,
+ stop = None, doubled = None):
+ self.filename = filename
+ self.doubled = doubled
+ #determine if the file is bam or sam
+ self.filetype = os.path.splitext(self.filename)[1]
+ #throw an error if the file is not bam or sam
+ if self.filetype not in ['.bam']:
+ raise Exception("""You have provided a file with an extension other than
+ '.bam', please check your command-line arguments""")
+ #now make sure there is an index file for the bam file
+ if not os.path.exists("{}.bai".format(self.filename)):
+ raise Exception("""Your .bam file is there, but it isn't indexed and
+ there isn't a .bai file to go with it. Use
+ 'samtools index <yourfile>.bam' to fix it.""")
+ #now open the file and just call it a sambam file
+ filetype_dict = {'.sam': '', '.bam': 'b'}
+ self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype]))
+ if chrid == None:
+ self.chrid = self.sambam.references[0]
+ else:
+ self.chrid = chrid
+ self.refindex = self.sambam.references.index(self.chrid)
+ self.seqlength = self.sambam.lengths[self.refindex]
+ self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2)
+ if start == None or stop == None:
+ self.start = 1
+ self.stop = self.true_seqlength
+
+ self.features = self.parse()
+ self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True)
+ self.features.reset_index(inplace=True)
+ self.features.drop('index', 1, inplace=True)
+
+ self.raw_depthmap = self.get_depthmap()
+ self.features_depthmap = self.get_features_depthmap()
+
+ def get_depthmap(self):
+ depthmap = [0] * (self.stop - self.start + 1)
+ for p in self.sambam.pileup(self.chrid, self.start, self.stop):
+ index = p.reference_pos
+ if index >= self.true_seqlength:
+ index -= self.true_seqlength
+ depthmap[index] += p.nsegments
+ return depthmap
+
+ def get_features_depthmap(self):
+ """this method builds a more accurate pileup that is
+ based on if there is actually a mapped base at any
+ given position or not. better for long reads and RNA"""
+ depthmap = [0] * (self.stop - self.start + 1)
+ print("depthmap is: {} long".format(len(depthmap)))
+ for index, row in self.features.iterrows():
+ thisindex = row["POS"] - self.start
+ for thistup in row["TUPS"]:
+ b_type = thistup[1]
+ b_len = thistup[0]
+ if b_type == "M":
+ for j in range(b_len):
+ #this is necessary to reset the index if we wrap
+ # around to the beginning
+ if self.doubled and thisindex == len(depthmap):
+ thisindex = 0
+ depthmap[thisindex] += 1
+ thisindex += 1
+ elif b_type in ["S", "H", "I"]:
+ pass
+ elif b_type in ["D", "N"]:
+ thisindex += b_len
+ #this is necessary to reset the index if we wrap
+ # around to the beginning
+ if self.doubled and thisindex >= len(depthmap):
+ thisindex = thisindex - len(depthmap)
+
+ return depthmap
+
+ def parse(self):
+ data = {'POS': [], 'MAPQ': [], 'TUPS': [] }
+ for read in self.sambam.fetch(self.chrid, self.start, self.stop):
+ data['POS'].append(read.reference_start + 1)
+ data['TUPS'].append(self.cigar_parse(read.cigartuples))
+ data['MAPQ'].append(read.mapq)
+ features = pd.DataFrame.from_dict(data, orient='columns')
+ features['ALNLEN'] = features['TUPS'].apply(self.aln_len)
+ features['TRULEN'] = features['TUPS'].apply(self.tru_len)
+ features['MAPLEN'] = features['TUPS'].apply(self.map_len)
+ features['POS'] = features['POS'].apply(self.fix_pos)
+ return features
+
+ def cigar_parse(self, tuples):
+ """
+ arguments:
+ <tuples> a CIGAR string tuple list in pysam format
+
+ purpose:
+ This function uses the pysam cigarstring tuples format and returns
+ a list of tuples in the internal format, [(20, 'M'), (5, "I")], et
+ cetera. The zeroth element of each tuple is the number of bases for the
+ CIGAR string feature. The first element of each tuple is the CIGAR
+ string feature type.
+
+ There are several feature types in SAM/BAM files. See below:
+ 'M' - match
+ 'I' - insertion relative to reference
+ 'D' - deletion relative to reference
+ 'N' - skipped region from the reference
+ 'S' - soft clip, not aligned but still in sam file
+ 'H' - hard clip, not aligned and not in sam file
+ 'P' - padding (silent deletion from padded reference)
+ '=' - sequence match
+ 'X' - sequence mismatch
+ 'B' - BAM_CBACK (I don't actually know what this is)
+
+ """
+ # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment
+ psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S',
+ 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'}
+ return [(value, psam_to_char[feature]) for feature, value in tuples]
+
+ def aln_len(self, TUPS):
+ """
+ arguments:
+ <TUPS> a list of tuples output from the cigar_parse() function.
+
+ purpose:
+ This returns the alignment length of the read to the reference.
+ Specifically, it sums the length of all of the matches and deletions.
+ In effect, this number is length of the region of the reference sequence to
+ which the read maps. This number is probably the most useful for selecting
+ reads to visualize in the mapped read plot.
+ """
+ return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']])
+
+ def map_len(self, TUPS):
+ """
+ arguments:
+ <TUPS> a list of tuples output from the cigar_parse() function.
+
+ purpose:
+ This function returns the map length (all matches and deletions relative to
+ the reference), plus the unmapped 5' and 3' hard/soft clipped sequences.
+ This number is useful if you want to visualize how much 5' and 3' sequence
+ of a read did not map to the reference. For example, poor quality 5' and 3'
+ tails are common in Nanopore reads.
+ """
+ return sum([pair[0] for pair in TUPS if pair[1] not in ['I']])
+
+ def tru_len(self, TUPS):
+ """
+ arguments:
+ <TUPS> a list of tuples output from the cigar_parse() function.
+
+ purpose:
+ This function returns the total length of the read, including insertions,
+ deletions, matches, soft clips, and hard clips. This is useful for
+ comparing to the map length or alignment length to see what percentage of
+ the read aligned to the reference.
+ """
+ return sum([pair[0] for pair in TUPS])
+
+ def fix_pos(self, start_index):
+ """
+ arguments:
+ an int
+
+ purpose:
+ When using a doubled SAMfile, any reads that start after the first copy
+ of the reference risk running over the plotting window, causing the program
+ to crash. This function corrects for this issue by changing the start site
+ of the read.
+
+ Note: this will probably break the program if not using a double alignment
+ since no reads would map past half the length of the single reference
+ """
+ if self.doubled:
+ if start_index > int(self.seqlength/2):
+ return start_index - int(self.seqlength/2) - 1
+ else:
+ return start_index
+ else:
+ return start_index
diff --git a/pauvre/browser.py b/pauvre/browser.py
new file mode 100644
index 0000000..d6b96ec
--- /dev/null
+++ b/pauvre/browser.py
@@ -0,0 +1,426 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# following this tutorial to install helvetica
+# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
+global hfont
+hfont = {'fontname':'Helvetica'}
+
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from matplotlib.colors import LinearSegmentedColormap, Normalize
+import matplotlib.patches as patches
+
+
+import gffutils
+import pandas as pd
+pd.set_option('display.max_columns', 500)
+pd.set_option('display.width', 1000)
+import numpy as np
+import os
+import pauvre.rcparams as rc
+from pauvre.functions import GFFParse, print_images, timestamp
+from pauvre import gfftools
+from pauvre.lsi.lsi import intersection
+from pauvre.bamparse import BAMParse
+import progressbar
+import platform
+import sys
+import time
+
+# Biopython stuff
+from Bio import SeqIO
+import Bio.SubsMat.MatrixInfo as MI
+
+
+class PlotCommand:
+ def __init__(self, plotcmd, REF):
+ self.ref = REF
+ self.style_choices = []
+ self.cmdtype = ""
+ self.path = ""
+ self.style = ""
+ self.options = ""
+ self._parse_cmd(plotcmd)
+
+ def _parse_cmd(self, plotcmd):
+ chunks = plotcmd.split(":")
+ if chunks[0] == "ref":
+ self.cmdtype = "ref"
+ if len(chunks) < 2:
+ self._len_error()
+ self.path = self.ref
+ self.style = chunks[1]
+ self.style_choices = ["normal", "colorful"]
+ self._check_style_choices()
+ if len(chunks) > 2:
+ self.options = chunks[2].split(",")
+ elif chunks[0] in ["bam", "peptides"]:
+ if len(chunks) < 3:
+ self._len_error()
+ self.cmdtype = chunks[0]
+ self.path = os.path.abspath(os.path.expanduser(chunks[1]))
+ self.style = chunks[2]
+ if self.cmdtype == "bam":
+ self.style_choices = ["depth", "reads"]
+ else:
+ self.style_choices = ["depth"]
+ self._check_style_choices()
+ if len(chunks) > 3:
+ self.options = chunks[3].split(",")
+ elif chunks[0] in ["gff3"]:
+ if len(chunks) < 2:
+ self._len_error()
+ self.cmdtype = chunks[0]
+ self.path = os.path.abspath(os.path.expanduser(chunks[1]))
+ if len(chunks) > 2:
+ self.options = chunks[2].split(",")
+
+
+ def _len_error(self):
+ raise IOError("""You selected {} to plot,
+ but need to specify the style at least.""".format(self.cmdtype))
+ def _check_style_choices(self):
+ if self.style not in self.style_choices:
+ raise IOError("""You selected {} style for
+ ref. You must select from {}. """.format(
+ self.style, self.style_choices))
+
+global dna_color
+dna_color = {"A": (81/255, 87/255, 251/255, 1),
+ "T": (230/255, 228/255, 49/255, 1),
+ "G": (28/255, 190/255, 32/255, 1),
+ "C": (220/255, 10/255, 23/255, 1)}
+
+#these are the line width for the different cigar string flags.
+# usually, only M, I, D, S, and H appear in bwa mem output
+global widthDict
+widthDict = {'M':0.45, # match
+ 'I':0.9, # insertion relative to reference
+ 'D':0.05, # deletion relative to reference
+ 'N':0.1, # skipped region from the reference
+ 'S':0.1, # soft clip, not aligned but still in sam file
+ 'H':0.1, # hard clip, not aligned and not in sam file
+ 'P':0.1, # padding (silent deletion from padded reference)
+ '=':0.1, # sequence match
+ 'X':0.1} # sequence mismatch
+
+
+global richgrey
+richgrey = (60/255, 54/255, 69/255, 1)
+
+def plot_ref(panel, chrid, start, stop, thiscmd):
+ panel.set_xlim([start, stop])
+ panel.set_ylim([-2.5, 2.5])
+ panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)])
+ if thiscmd.style == "colorful":
+ thisseq = ""
+ for record in SeqIO.parse(thiscmd.ref, "fasta"):
+ if record.id == chrid:
+ thisseq = record.seq[start-1: stop]
+ for i in range(len(thisseq)):
+ left = start + i
+ bottom = -0.5
+ width = 1
+ height = 1
+ rect = patches.Rectangle((left, bottom),
+ width, height,
+ linewidth = 0,
+ facecolor = dna_color[thisseq[i]] )
+ panel.add_patch(rect)
+ return panel
+
+def safe_log10(value):
+ try:
+ logval = np.log10(value)
+ except:
+ logval = 0
+ return logval
+
+def plot_bam(panel, chrid, start, stop, thiscmd):
+ bam = BAMParse(thiscmd.path)
+ panel.set_xlim([start, stop])
+ if thiscmd.style == "depth":
+ maxdepth = max(bam.features_depthmap)
+ maxdepthlog = safe_log10(maxdepth)
+ if "log" in thiscmd.options:
+ panel.set_ylim([-maxdepthlog, maxdepthlog])
+ panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)])
+
+ else:
+ panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)])
+ if "c" in thiscmd.options:
+ panel.set_ylim([-maxdepth, maxdepth])
+ else:
+ panel.set_ylim([0, maxdepth])
+
+
+ for i in range(len(bam.features_depthmap)):
+ left = start + i
+ width = 1
+ if "c" in thiscmd.options and "log" in thiscmd.options:
+ bottom = -1 * safe_log10(bam.features_depthmap[i])
+ height = safe_log10(bam.features_depthmap[i]) * 2
+ elif "c" in thiscmd.options and "log" not in thiscmd.options:
+ bottom = -bam.features_depthmap[i]
+ height = bam.features_depthmap[i] * 2
+ else:
+ bottom = 0
+ height = bam.features_depthmap[i]
+ if height > 0:
+ rect = patches.Rectangle((left, bottom),
+ width, height,
+ linewidth = 0,
+ facecolor = richgrey )
+ panel.add_patch(rect)
+
+ if thiscmd.style == "reads":
+ #If we're plotting reads, we don't need y-axis
+ panel.tick_params(bottom="off", labelbottom="off",
+ left ="off", labelleft = "off")
+ reads = bam.features.copy()
+ panel.set_xlim([start, stop])
+ direction = "for"
+ if direction == 'for':
+ bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
+ direction= 'rev'
+ elif direction == 'rev':
+ bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
+ direction = 'for'
+ reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
+ reads.reset_index(drop=True, inplace=True)
+
+ depth_count = -1
+ plotind = start
+ while len(reads) > 0:
+ #depth_count -= 1
+ #print("len of reads is {}".format(len(reads)))
+ potential = reads.query("POS >= {}".format(plotind))
+ if len(potential) == 0:
+ readsindex = 0
+ #print("resetting plot ind from {} to {}".format(
+ # plotind, reads.loc[readsindex, "POS"]))
+ depth_count -= 1
+
+ else:
+ readsindex = int(potential.index.values[0])
+ #print("pos of potential is {}".format(reads.loc[readsindex, "POS"]))
+ plotind = reads.loc[readsindex, "POS"]
+
+ for TUP in reads.loc[readsindex, "TUPS"]:
+ b_type = TUP[1]
+ b_len = TUP[0]
+ #plotting params
+ # left same for all.
+ left = plotind
+ bottom = depth_count
+ height = widthDict[b_type]
+ width = b_len
+ plot = True
+ color = richgrey
+ if b_type in ["H", "S"]:
+ """We don't plot hard or sort clips - like IGV"""
+ plot = False
+ pass
+ elif b_type == "M":
+ """just plot matches normally"""
+ plotind += b_len
+ elif b_type in ["D", "P", "=", "X"]:
+ """deletions get an especially thin line"""
+ plotind += b_len
+ elif b_type == "I":
+ """insertions get a special purple bar"""
+ left = plotind - (b_len/2)
+ color = (200/255, 41/255, 226/255, 0.5)
+ elif b_type == "N":
+ """skips for splice junctions, line in middle"""
+ bottom += (widthDict["M"]/2) - (widthDict["N"]/2)
+ plotind += b_len
+ if plot:
+ rect = patches.Rectangle((left, bottom),
+ width, height,
+ linewidth = 0,
+ facecolor = color )
+ panel.add_patch(rect)
+ reads.drop([readsindex], inplace=True)
+ reads.reset_index(drop = True, inplace=True)
+ panel.set_ylim([depth_count, 0])
+
+ return panel
+
+def plot_gff3(panel, chrid, start, stop, thiscmd):
+
+ db = gffutils.create_db(thiscmd.path, ":memory:")
+ bottom = 0
+ genes_to_plot = [thing.id
+ for thing in db.region(
+ region=(chrid, start, stop),
+ completely_within=False)
+ if thing.featuretype == "gene" ]
+ #print("genes to plot are: " genes_to_plot)
+ panel.set_xlim([start, stop])
+ # we don't need labels on one of the axes
+ #panel.tick_params(bottom="off", labelbottom="off",
+ # left ="off", labelleft = "off")
+
+
+ ticklabels = []
+ for geneid in genes_to_plot:
+ plotnow = False
+ if "id" in thiscmd.options and geneid in thiscmd.options:
+ plotnow = True
+ elif "id" not in thiscmd.options:
+ plotnow = True
+ if plotnow:
+ ticklabels.append(geneid)
+ if db[geneid].strand == "+":
+ panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db,
+ bottom, text = None)
+ bottom += 1
+ else:
+ raise IOError("""Plotting things on the reverse strand is
+ not yet implemented""")
+ #print("tick labels are", ticklabels)
+ panel.set_ylim([0, len(ticklabels)])
+ yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))]
+ panel.set_yticks(yticks_vals)
+ print("bottom is: ", bottom)
+ print("len tick labels is: ", len(ticklabels))
+ print("intervals are: ", yticks_vals)
+ panel.set_yticklabels(ticklabels)
+
+ return panel
+
+def browser(args):
+ rc.update_rcParams()
+ print(args)
+
+ # if the user forgot to add a reference, they must add one
+ if args.REF is None:
+ raise IOError("You must specify the reference fasta file")
+
+ # if the user forgot to add the start and stop,
+ # Print the id and the start/stop
+ if args.CHR is None or args.START is None or args.STOP is None:
+ print("""\n You have forgotten to specify the chromosome,
+ the start coordinate, or the stop coordinate to plot.
+ Try something like '-c chr1 --start 20 --stop 2000'.
+ Here is a list of chromosome ids and their lengths
+ from the provided reference. The minimum start coordinate
+ is one and the maximum stop coordinate is the length of
+ the chromosome.\n\nID\tLength""")
+ for record in SeqIO.parse(args.REF, "fasta"):
+ print("{}\t{}".format(record.id, len(record.seq)))
+ sys.exit(0)
+
+ if args.CMD is None:
+ raise IOError("You must specify a plotting command.")
+
+ # now we parse each set of commands
+ commands = [PlotCommand(thiscmd, args.REF)
+ for thiscmd in reversed(args.CMD)]
+
+ # set the figure dimensions
+ if args.ratio:
+ figWidth = args.ratio[0] + 1
+ figHeight = args.ratio[1] + 1
+ #set the panel dimensions
+ panelWidth = args.ratio[0]
+ panelHeight = args.ratio[1]
+
+ else:
+ figWidth = 7
+ figHeight = len(commands) + 2
+ #set the panel dimensions
+ panelWidth = 5
+ # panel margin x 2 + panel height = total vertical height
+ panelHeight = 0.8
+ panelMargin = 0.1
+
+ figure = plt.figure(figsize=(figWidth,figHeight))
+
+ #find the margins to center the panel in figure
+ leftMargin = (figWidth - panelWidth)/2
+ bottomMargin = ((figHeight - panelHeight)/2) + panelMargin
+
+ plot_dict = {"ref": plot_ref,
+ "bam": plot_bam,
+ "gff3": plot_gff3
+ #"peptides": plot_peptides
+ }
+
+ panels = []
+ for i in range(len(commands)):
+ thiscmd = commands[i]
+ if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \
+ or thiscmd.style == "depth" \
+ or "narrow" in thiscmd.options:
+ temp_panelHeight = 0.5
+ else:
+ temp_panelHeight = panelHeight
+ panels.append( plt.axes([leftMargin/figWidth, #left
+ bottomMargin/figHeight, #bottom
+ panelWidth/figWidth, #width
+ temp_panelHeight/figHeight]) #height
+ )
+ panels[i].tick_params(axis='both',which='both',\
+ bottom='off', labelbottom='off',\
+ left='on', labelleft='on', \
+ right='off', labelright='off',\
+ top='off', labeltop='off')
+ if thiscmd.cmdtype == "ref":
+ panels[i].tick_params(bottom='on', labelbottom='on')
+
+
+
+ #turn off some of the axes
+ panels[i].spines["top"].set_visible(False)
+ panels[i].spines["bottom"].set_visible(False)
+ panels[i].spines["right"].set_visible(False)
+ panels[i].spines["left"].set_visible(False)
+
+ panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR,
+ args.START, args.STOP,
+ thiscmd)
+
+ bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin)
+
+ # Print image(s)
+ if args.BASENAME is None:
+ file_base = 'browser_{}.png'.format(timestamp())
+ else:
+ file_base = args.BASENAME
+ path = None
+ if args.path:
+ path = args.path
+ transparent = args.transparent
+ print_images(
+ base_output_name=file_base,
+ image_formats=args.fileform,
+ dpi=args.dpi,
+ no_timestamp = kwargs["no_timestamp"],
+ path = path,
+ transparent=transparent)
+
+
+def run(args):
+ browser(args)
diff --git a/pauvre/custommargin.py b/pauvre/custommargin.py
new file mode 100644
index 0000000..d522e42
--- /dev/null
+++ b/pauvre/custommargin.py
@@ -0,0 +1,441 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore PhD student's plotting package
+# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+import ast
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import matplotlib.patches as mplpatches
+from matplotlib.colors import LinearSegmentedColormap
+import numpy as np
+import pandas as pd
+import os.path as opath
+from sys import stderr
+from pauvre.functions import print_images
+from pauvre.marginplot import generate_panel
+from pauvre.stats import stats
+import pauvre.rcparams as rc
+import sys
+import logging
+
+# logging
+logger = logging.getLogger('pauvre')
+
+def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
+ """This helper method generates the histogram that is added to the panel.
+
+ In this case, horizontal = True applies to the mean quality histogram.
+ So, horizontal = False only applies to the length histogram.
+ """
+ l_width = 0.0
+ f_color = (0.5, 0.5, 0.5)
+ e_color = (0, 0, 0)
+ if horizontal:
+ for step in np.arange(0, len(bin_values), 1):
+ left = bins[step]
+ bottom = 0
+ width = bins[step + 1] - bins[step]
+ height = bin_values[step]
+ hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
+ linewidth=l_width,
+ facecolor=f_color,
+ edgecolor=e_color)
+ panel.add_patch(hist_rectangle)
+ else:
+ for step in np.arange(0, len(bin_values), 1):
+ left = 0
+ bottom = bins[step]
+ width = bin_values[step]
+ height = bins[step + 1] - bins[step]
+
+ hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
+ linewidth=l_width,
+ facecolor=f_color,
+ edgecolor=e_color)
+ panel.add_patch(hist_rectangle)
+
+
+def generate_histogram(panel, data_list, min_plot_val, max_plot_val,
+ bin_interval, hist_horizontal=True,
+ left_spine=True, bottom_spine=True,
+ top_spine=False, right_spine=False, x_label=None,
+ y_label=None):
+
+ bins = np.arange(0, max_plot_val, bin_interval)
+
+ bin_values, bins2 = np.histogram(data_list, bins)
+
+ # hist_horizontal is used for quality
+ if hist_horizontal:
+ panel.set_xlim([min_plot_val, max_plot_val])
+ panel.set_ylim([0, max(bin_values * 1.1)])
+ # and hist_horizontal == Fale is for read length
+ else:
+ panel.set_xlim([0, max(bin_values * 1.1)])
+ panel.set_ylim([min_plot_val, max_plot_val])
+
+ # Generate histogram bin patches, depending on whether we're plotting
+ # vertically or horizontally
+ _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
+
+ panel.spines['left'].set_visible(left_spine)
+ panel.spines['bottom'].set_visible(bottom_spine)
+ panel.spines['top'].set_visible(top_spine)
+ panel.spines['right'].set_visible(right_spine)
+
+ if y_label is not None:
+ panel.set_ylabel(y_label)
+ if x_label is not None:
+ panel.set_xlabel(x_label)
+
+def generate_square_map(panel, data_frame, plot_min_y, plot_min_x,
+ plot_max_y, plot_max_x, color,
+ xcol, ycol, **kwargs):
+ """This generates the heatmap panels using squares. Everything is
+ quantized by ints.
+ """
+ panel.set_xlim([plot_min_x, plot_max_x])
+ panel.set_ylim([plot_min_y, plot_max_y])
+ tempdf = data_frame[[xcol, ycol]]
+ data_frame = tempdf.astype(int)
+
+ querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
+ print(" - Filtering squares with {}".format(querystring))
+ square_this = data_frame.query(querystring)
+
+ querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
+ print(" - Filtering squares with {}".format(querystring))
+ square_this = square_this.query(querystring)
+
+ counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts')
+ for index, row in counts.iterrows():
+ x_pos = row[xcol]
+ y_pos = row[ycol]
+ thiscolor = color(row["counts"]/(counts["counts"].max()))
+ rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1,
+ linewidth=0,\
+ facecolor=thiscolor)
+ panel.add_patch(rectangle1)
+
+ all_counts = counts["counts"]
+ return all_counts
+
+def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x,
+ plot_max_y, plot_max_x, color,
+ xcol, ycol, **kwargs):
+ panel.set_xlim([plot_min_x, plot_max_x])
+ panel.set_ylim([plot_min_y, plot_max_y])
+
+ querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
+ print(" - Filtering hexmap with {}".format(querystring))
+ hex_this = data_frame.query(querystring)
+
+ querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
+ print(" - Filtering hexmap with {}".format(querystring))
+ hex_this = hex_this.query(querystring)
+
+ # This single line controls plotting the hex bins in the panel
+ hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49,
+ linewidths=0.0, cmap=color)
+ for each in panel.spines:
+ panel.spines[each].set_visible(False)
+
+ counts = hex_vals.get_array()
+ return counts
+
+def generate_legend(panel, counts, color):
+ # completely custom for more control
+ panel.set_xlim([0, 1])
+ panel.set_ylim([0, 1000])
+ panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
+ panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
+ for i in np.arange(0, 1001, 1):
+ rgba = color(i / 1001)
+ alpha = rgba[-1]
+ facec = rgba[0:3]
+ hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
+ linewidth=0.0,
+ facecolor=facec,
+ edgecolor=(0, 0, 0),
+ alpha=alpha)
+ panel.add_patch(hist_rectangle)
+ panel.spines['top'].set_visible(False)
+ panel.spines['left'].set_visible(False)
+ panel.spines['bottom'].set_visible(False)
+ panel.yaxis.set_label_position("right")
+ panel.set_ylabel('count')
+
+def custommargin(df, **kwargs):
+ rc.update_rcParams()
+
+ # 250, 231, 34 light yellow
+ # 67, 1, 85
+ # R=np.linspace(65/255,1,101)
+ # G=np.linspace(0/255, 231/255, 101)
+ # B=np.linspace(85/255, 34/255, 101)
+ # R=65/255, G=0/255, B=85/255
+ Rf = 65 / 255
+ Bf = 85 / 255
+ pdict = {'red': ((0.0, Rf, Rf),
+ (1.0, Rf, Rf)),
+ 'green': ((0.0, 0.0, 0.0),
+ (1.0, 0.0, 0.0)),
+ 'blue': ((0.0, Bf, Bf),
+ (1.0, Bf, Bf)),
+ 'alpha': ((0.0, 0.0, 0.0),
+ (1.0, 1.0, 1.0))
+ }
+ # Now we will use this example to illustrate 3 ways of
+ # handling custom colormaps.
+ # First, the most direct and explicit:
+ purple1 = LinearSegmentedColormap('Purple1', pdict)
+
+ # set the figure dimensions
+ fig_width = 1.61 * 3
+ fig_height = 1 * 3
+ fig = plt.figure(figsize=(fig_width, fig_height))
+
+ # set the panel dimensions
+ heat_map_panel_width = fig_width * 0.5
+ heat_map_panel_height = heat_map_panel_width * 0.62
+
+ # find the margins to center the panel in figure
+ fig_left_margin = fig_bottom_margin = (1 / 6)
+
+ # lengthPanel
+ y_panel_width = (1 / 8)
+
+ # the color Bar parameters
+ legend_panel_width = (1 / 24)
+
+ # define padding
+ h_padding = 0.02
+ v_padding = 0.05
+
+ # Set whether to include y-axes in histograms
+ print(" - Setting panel options.", file = sys.stderr)
+ if kwargs["Y_AXES"]:
+ y_bottom_spine = True
+ y_bottom_tick = True
+ y_bottom_label = True
+ x_left_spine = True
+ x_left_tick = True
+ x_left_label = True
+ x_y_label = 'Count'
+ else:
+ y_bottom_spine = False
+ y_bottom_tick = False
+ y_bottom_label = False
+ x_left_spine = False
+ x_left_tick = False
+ x_left_label = False
+ x_y_label = None
+
+ panels = []
+
+ # Quality histogram panel
+ print(" - Generating the x-axis panel.", file = sys.stderr)
+ x_panel_left = fig_left_margin + y_panel_width + h_padding
+ x_panel_width = heat_map_panel_width / fig_width
+ x_panel_height = y_panel_width * fig_width / fig_height
+ x_panel = generate_panel(x_panel_left,
+ fig_bottom_margin,
+ x_panel_width,
+ x_panel_height,
+ left_tick_param=x_left_tick,
+ label_left_tick_param=x_left_label)
+ panels.append(x_panel)
+
+ # y histogram panel
+ print(" - Generating the y-axis panel.", file = sys.stderr)
+ y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
+ y_panel_height = heat_map_panel_height / fig_height
+ y_panel = generate_panel(fig_left_margin,
+ y_panel_bottom,
+ y_panel_width,
+ y_panel_height,
+ bottom_tick_param=y_bottom_tick,
+ label_bottom_tick_param=y_bottom_label)
+ panels.append(y_panel)
+
+ # Heat map panel
+ heat_map_panel_left = fig_left_margin + y_panel_width + h_padding
+ heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
+ print(" - Generating the heat map panel.", file = sys.stderr)
+ heat_map_panel = generate_panel(heat_map_panel_left,
+ heat_map_panel_bottom,
+ heat_map_panel_width / fig_width,
+ heat_map_panel_height / fig_height,
+ bottom_tick_param='off',
+ label_bottom_tick_param='off',
+ left_tick_param='off',
+ label_left_tick_param='off')
+ panels.append(heat_map_panel)
+ heat_map_panel.set_title(kwargs["title"])
+
+ # Legend panel
+ print(" - Generating the legend panel.", file = sys.stderr)
+ legend_panel_left = fig_left_margin + y_panel_width + \
+ heat_map_panel_width / fig_width + h_padding
+ legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
+ legend_panel_height = heat_map_panel_height / fig_height
+ legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
+ legend_panel_width, legend_panel_height,
+ bottom_tick_param=False,
+ label_bottom_tick_param=False,
+ left_tick_param=False,
+ label_left_tick_param=False,
+ right_tick_param=True,
+ label_right_tick_param=True)
+ panels.append(legend_panel)
+
+ #
+ # Everything above this is just to set up the panels
+ #
+ ##################################################################
+
+ # Set max and min viewing window for the xaxis
+ if kwargs["plot_max_x"]:
+ plot_max_x = kwargs["plot_max_x"]
+ else:
+ if kwargs["square"]:
+ plot_max_x = df[kwargs["xcol"]].max()
+ plot_max_x = max(np.ceil(df[kwargs["xcol"]]))
+ plot_min_x = kwargs["plot_min_x"]
+
+ # Set x bin sizes
+ if kwargs["xbin"]:
+ x_bin_interval = kwargs["xbin"]
+ else:
+ # again, this is just based on what looks good from experience
+ x_bin_interval = 1
+
+ # Generate x histogram
+ print(" - Generating the x-axis histogram.", file = sys.stderr)
+ generate_histogram(panel = x_panel,
+ data_list = df[kwargs['xcol']],
+ min_plot_val = plot_min_x,
+ max_plot_val = plot_max_x,
+ bin_interval = x_bin_interval,
+ hist_horizontal = True,
+ x_label=kwargs['xcol'],
+ y_label=x_y_label,
+ left_spine=x_left_spine)
+
+ # Set max and min viewing window for the y axis
+ if kwargs["plot_max_y"]:
+ plot_max_y = kwargs["plot_max_y"]
+ else:
+ if kwargs["square"]:
+ plot_max_y = df[kwargs["ycol"]].max()
+ else:
+ plot_max_y = max(np.ceil(df[kwargs["ycol"]]))
+
+ plot_min_y = kwargs["plot_min_y"]
+ # Set y bin sizes
+ if kwargs["ybin"]:
+ y_bin_interval = kwargs["ybin"]
+ else:
+ y_bin_interval = 1
+
+ # Generate y histogram
+ print(" - Generating the y-axis histogram.", file = sys.stderr)
+ generate_histogram(panel = y_panel,
+ data_list = df[kwargs['ycol']],
+ min_plot_val = plot_min_y,
+ max_plot_val = plot_max_y,
+ bin_interval = y_bin_interval,
+ hist_horizontal = False,
+ y_label = kwargs['ycol'],
+ bottom_spine = y_bottom_spine)
+
+ # Generate heat map
+ if kwargs["square"]:
+ print(" - Generating the square heatmap.", file = sys.stderr)
+ counts = generate_square_map(panel = heat_map_panel,
+ data_frame = df,
+ plot_min_y = plot_min_y,
+ plot_min_x = plot_min_x,
+ plot_max_y = plot_max_y,
+ plot_max_x = plot_max_x,
+ color = purple1,
+ xcol = kwargs["xcol"],
+ ycol = kwargs["ycol"])
+ else:
+ print(" - Generating the heatmap.", file = sys.stderr)
+ counts = generate_heat_map(panel = heat_map_panel,
+ data_frame = df,
+ plot_min_y = plot_min_y,
+ plot_min_x = plot_min_x,
+ plot_max_y = plot_max_y,
+ plot_max_x = plot_max_x,
+ color = purple1,
+ xcol = kwargs["xcol"],
+ ycol = kwargs["ycol"])
+
+ # Generate legend
+ print(" - Generating the legend.", file = sys.stderr)
+ generate_legend(legend_panel, counts, purple1)
+
+ # inform the user of the plotting window if not quiet mode
+ #if not kwargs["QUIET"]:
+ # print("""plotting in the following window:
+ # {0} <= Q-score (x-axis) <= {1}
+ # {2} <= length (y-axis) <= {3}""".format(
+ # plot_min_x, plot_max_x, min_plot_val, max_plot_val),
+ # file=stderr)
+
+ # Print image(s)
+ if kwargs["output_base_name"] is None:
+ file_base = "custommargin"
+ else:
+ file_base = kwargs["output_base_name"]
+
+ print(" - Saving your images", file = sys.stderr)
+ print_images(
+ base =file_base,
+ image_formats=kwargs["fileform"],
+ dpi=kwargs["dpi"],
+ no_timestamp = kwargs["no_timestamp"],
+ transparent= kwargs["no_transparent"])
+
+def run(args):
+ print(args)
+ if not opath.exists(args.input_file):
+ raise IOError("The input file does not exist: {}".format(
+ args.input_file))
+ df = pd.read_csv(args.input_file, header='infer', sep='\t')
+ # make sure that the column names that were specified are actually
+ # in the dataframe
+ if args.xcol not in df.columns:
+ raise IOError("""The x-column name that you specified, {}, is not in the
+ dataframe column names: {}""".format(args.xcol, df.columns))
+ if args.ycol not in df.columns:
+ raise IOError("""The y-column name that you specified, {}, is not in the
+ dataframe column names: {}""".format(args.ycol, df.columns))
+ print(" - Successfully read csv file. Here are a few lines:",
+ file = sys.stderr)
+ print(df.head(), file = sys.stderr)
+ print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr)
+ print(df[args.xcol].head(), file = sys.stderr)
+ print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr)
+ print(df[args.ycol].head(), file = sys.stderr)
+ custommargin(df=df.dropna(), **vars(args))
diff --git a/pauvre/functions.py b/pauvre/functions.py
new file mode 100644
index 0000000..c6ca2b8
--- /dev/null
+++ b/pauvre/functions.py
@@ -0,0 +1,347 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre
+# Copyright (c) 2016-2020 Darrin T. Schultz.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+from Bio import SeqIO
+import copy
+import gzip
+import matplotlib.pyplot as plt
+import numpy as np
+import os
+import pandas as pd
+from sys import stderr
+import time
+
+
+# this makes opening files more robust for different platforms
+# currently only used in GFFParse
+import codecs
+
+import warnings
+
+def print_images(base, image_formats, dpi,
+ transparent=False, no_timestamp = False):
+ """
+ Save the plot in multiple formats, with or without transparency
+ and with or without timestamps.
+ """
+ for fmt in image_formats:
+ if no_timestamp:
+ out_name = "{0}.{1}".format(base, fmt)
+ else:
+ out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt)
+ try:
+ if fmt == 'png':
+ plt.savefig(out_name, dpi=dpi, transparent=transparent)
+ else:
+ plt.savefig(out_name, format=fmt, transparent=transparent)
+ except PermissionError:
+ # thanks to https://github.com/wdecoster for the suggestion
+ print("""You don't have permission to save pauvre plots to this
+ directory. Try changing the directory and running the script again!""")
+
+class GFFParse():
+ def __init__(self, filename, stop_codons=None, species=None):
+ self.filename = filename
+ self.samplename = os.path.splitext(os.path.basename(filename))[0]
+ self.species = species
+ self.featureDict = {"name": [],
+ "featType": [],
+ "start": [],
+ "stop": [],
+ "strand": []}
+ gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1",
+ "strand", "dunno2", "tags"]
+ self.features = pd.read_csv(self.filename, comment='#',
+ sep='\t', names=gffnames)
+ self.features['name'] = self.features['tags'].apply(self._get_name)
+ self.features.drop('dunno1', 1, inplace=True)
+ self.features.drop('dunno2', 1, inplace=True)
+ self.features.reset_index(inplace=True, drop=True)
+ # warn the user if there are CDS or gene entries not divisible by three
+ self._check_triplets()
+ # sort the database by start
+ self.features.sort_values(by='start', ascending=True, inplace=True)
+ if stop_codons:
+ strip_codons = ['gene', 'CDS']
+ # if the direction is forward, subtract three from the stop to bring it closer to the start
+ self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\
+ self.features.loc[(self.features['featType'].isin(strip_codons))
+ & (self.features['strand'] == '+'), 'stop'] - 3
+ # if the direction is reverse, add three to the start (since the coords are flip-flopped)
+ self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\
+ self.features.loc[(self.features['featType'].isin(strip_codons))
+ & (self.features['strand'] == '-'), 'start'] + 3
+ self.features['center'] = self.features['start'] + \
+ ((self.features['stop'] - self.features['start']) / 2)
+ # we need to add one since it doesn't account for the last base otherwise
+ self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1
+ self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1)
+ self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1)
+ self.features['track'] = 0
+ if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1:
+ raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true"
+ with a region from 1 to the number of bases in the mitogenome
+
+ example:
+ Bf201311 Geneious region 1 13337 . + 0 Is_circular=true
+ """)
+ self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop'])
+ self.features.reset_index(inplace=True, drop=True)
+ #print("float", self.features.loc[self.features['name'] == 'COX1', 'center'])
+ #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center']))
+ # print(self.features)
+ # print(self.seqlen)
+
+ def set_features(self, new_features):
+ """all this does is reset the features pandas dataframe"""
+ self.features = new_features
+
+ def get_unique_genes(self):
+ """This returns a series of gene names"""
+ plottable = self.features.query(
+ "featType != 'tRNA' and featType != 'region' and featType != 'source'")
+ return set(plottable['name'].unique())
+
+ def shuffle(self):
+ """
+ this returns a list of all possible shuffles of features.
+ A shuffle is when the frontmost bit of coding + noncoding DNA up
+ until the next bit of coding DNA is removed and tagged on the
+ end of the sequence. In this case this process is represented by
+ shifting gff coordinates.
+ """
+ shuffles = []
+ # get the index of the first element
+ # get the index of the next thing
+ # subtract the indices of everything, then reset the ones that are below
+ # zero
+ done = False
+ shuffle_features = self.features[self.features['featType'].isin(
+ ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True)
+ # we first add the shuffle features without reorganizing
+ # print("shuffle\n",shuffle_features)
+ add_first = copy.deepcopy(self)
+ add_first.set_features(shuffle_features)
+ shuffles.append(add_first)
+ # first gene is changed with every iteration
+ first_gene = list(shuffle_features['name'])[0]
+ # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop
+ absolute_first = list(shuffle_features['name'])[0]
+ while not done:
+ # We need to prevent the case of shuffling in the middle of
+ # overlapped genes. Do this by ensuring that the the start of
+ # end of first gene is less than the start of the next gene.
+ first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop'])
+ next_gene = ""
+ for next_index in range(1, len(shuffle_features)):
+ # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index
+ next_gene_df = list(
+ shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name'])
+ if len(next_gene_df) != 0:
+ next_gene = next_gene_df[next_index]
+ next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
+ #print("looking at {}, prev_stop is {}, start is {}".format(
+ # next_gene, first_stop, next_start))
+ #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])])
+ # if the gene we're looking at and the next one don't overlap, move on
+ if first_stop < next_start:
+ break
+ #print("next_gene before checking for first is {}".format(next_gene))
+ if next_gene == absolute_first:
+ done = True
+ break
+ # now we can reset the first gene for the next iteration
+ first_gene = next_gene
+ shuffle_features = shuffle_features.copy(deep=True)
+ # figure out where the next start point is going to be
+ next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
+ #print('next gene: {}'.format(next_gene))
+ shuffle_features['start'] = shuffle_features['start'] - next_start + 1
+ shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1
+ shuffle_features['center'] = shuffle_features['center'] - next_start + 1
+ # now correct the values that are less than 0
+ shuffle_features.loc[shuffle_features['start'] < 1,
+ 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen
+ shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop']
+ < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width']
+ shuffle_features['center'] = shuffle_features['start'] + \
+ ((shuffle_features['stop'] - shuffle_features['start']) / 2)
+ shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1)
+ shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1)
+ shuffle_features.sort_values(by='start', ascending=True, inplace=True)
+ shuffle_features.reset_index(inplace=True, drop=True)
+ new_copy = copy.deepcopy(self)
+ new_copy.set_features(shuffle_features)
+ shuffles.append(new_copy)
+ #print("len shuffles: {}".format(len(shuffles)))
+ return shuffles
+
+ def couple(self, other_GFF, this_y=0, other_y=1):
+ """
+ Compares this set of features to another set and generates tuples of
+ (x,y) coordinate pairs to input into lsi
+ """
+ other_features = other_GFF.features
+ coordinates = []
+ for thisname in self.features['name']:
+ othermatch = other_features.loc[other_features['name'] == thisname, 'center']
+ if len(othermatch) == 1:
+ this_x = float(self.features.loc[self.features['name']
+ == thisname, 'center']) # /self.seqlen
+ other_x = float(othermatch) # /other_GFF.seqlen
+ # lsi can't handle vertical or horizontal lines, and we don't
+ # need them either for our comparison. Don't add if equivalent.
+ if this_x != other_x:
+ these_coords = ((this_x, this_y), (other_x, other_y))
+ coordinates.append(these_coords)
+ return coordinates
+
+ def _check_triplets(self):
+ """This method verifies that all entries of featType gene and CDS are
+ divisible by three"""
+ genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'")
+ not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ]
+ if len(not_trips) > 0:
+ print_string = ""
+ print_string += "There are CDS and gene entries that are not divisible by three\n"
+ print_string += str(not_trips)
+ warnings.warn(print_string, SyntaxWarning)
+
+ def _get_name(self, tag_value):
+ """This extracts a name from a single row in 'tags' of the pandas
+ dataframe
+ """
+ try:
+ if ";" in tag_value:
+ name = tag_value[5:].split(';')[0]
+ else:
+ name = tag_value[5:].split()[0]
+ except:
+ name = tag_value
+ print("Couldn't correctly parse {}".format(
+ tag_value))
+ return name
+
+ def _determine_lmost(self, row):
+ """Booleans don't work well for pandas dataframes, so I need to use apply
+ """
+ if row['start'] < row['stop']:
+ return row['start']
+ else:
+ return row['stop']
+
+ def _determine_rmost(self, row):
+ """Booleans don't work well for pandas dataframes, so I need to use apply
+ """
+ if row['start'] < row['stop']:
+ return row['stop']
+ else:
+ return row['start']
+
+
+def parse_fastq_length_meanqual(fastq):
+ """
+ arguments:
+ <fastq> the fastq file path. Hopefully it has been verified to exist already
+
+ purpose:
+ This function parses a fastq and returns a pandas dataframe of read lengths
+ and read meanQuals.
+ """
+ # First try to open the file with the gzip package. It will crash
+ # if the file is not gzipped, so this is an easy way to test if
+ # the fastq file is gzipped or not.
+ try:
+ handle = gzip.open(fastq, "rt")
+ length, meanQual = _fastq_parse_helper(handle)
+ except:
+ handle = open(fastq, "r")
+ length, meanQual = _fastq_parse_helper(handle)
+
+ handle.close()
+ df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual'])
+ return df
+
+
+def filter_fastq_length_meanqual(df, min_len, max_len,
+ min_mqual, max_mqual):
+ querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual)
+ if max_len != None:
+ querystring += " and length <= {}".format(max_len)
+ if max_mqual != None:
+ querystring += " and meanQual <= {}".format(max_mqual)
+ print("Keeping reads that satisfy: {}".format(querystring), file=stderr)
+ filtdf = df.query(querystring)
+ #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce')
+ #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce')
+ return filtdf
+
+
+def _fastq_parse_helper(handle):
+ length = []
+ meanQual = []
+ for record in SeqIO.parse(handle, "fastq"):
+ if len(record) > 0:
+ meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"]))
+ length.append(len(record))
+ return length, meanQual
+
+
+def _geometric_mean(phred_values):
+ """in case I want geometric mean in the future, can calculate it like this"""
+ # np.mean(record.letter_annotations["phred_quality"]))
+ pass
+
+
+def _arithmetic_mean(phred_values):
+ """
+ Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean,
+ log transform back to Phred.
+ """
+ if not isinstance(phred_values, np.ndarray):
+ phred_values = np.array(phred_values)
+ return _erate_to_phred(np.mean(_phred_to_erate(phred_values)))
+
+
+def _phred_to_erate(phred_values):
+ """
+ converts a list or numpy array of phred values to a numpy array
+ of error rates
+ """
+ if not isinstance(phred_values, np.ndarray):
+ phred_values = np.array(phred_values)
+ return np.power(10, (-1 * (phred_values / 10)))
+
+
+def _erate_to_phred(erate_values):
+ """
+ converts a list or numpy array of error rates to a numpy array
+ of phred values
+ """
+ if not isinstance(erate_values, np.ndarray):
+ phred_values = np.array(erate_values)
+ return -10 * np.log10(erate_values)
+
+def timestamp():
+ """
+ Returns the current time in :samp:`YYYYMMDD_HHMMSS` format.
+ """
+ return time.strftime("%Y%m%d_%H%M%S")
diff --git a/pauvre/gfftools.py b/pauvre/gfftools.py
new file mode 100644
index 0000000..09b59bc
--- /dev/null
+++ b/pauvre/gfftools.py
@@ -0,0 +1,581 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+"""This file contains things related to parsing and plotting GFF files"""
+
+import copy
+from matplotlib.path import Path
+import matplotlib.patches as patches
+
+global chevron_width
+global arrow_width
+global min_text
+global text_cutoff
+
+arrow_width = 80
+chevron_width = 40
+min_text = 550
+text_cutoff = 150
+import sys
+
+global colorMap
+colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
+ 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
+ 'ORF':'orange'}
+
+def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None):
+ """ plots a left to right patch with introns when there are no intervening
+ sequences to consider. Uses a gene id and gffutils database as input.
+ b
+ a .-=^=-. c
+ 1__________2---/ e `---1__________2
+ | #lff \f d| #lff \
+ | left to \3 | left to \3
+ | right / | right /
+ 5___________/4 5___________/4
+ """
+ #first we need to determine the number of exons
+ bar_thickness = 0.75
+ #now we can start plotting the exons
+ exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
+ for i in range(len(exonlist)):
+ cds_start = exonlist[i].start
+ cds_stop = exonlist[i].stop
+ verts = [(cds_start, y_pos + bar_thickness), #1
+ (cds_stop - chevron_width, y_pos + bar_thickness), #2
+ (cds_stop, y_pos + (bar_thickness/2)), #3
+ (cds_stop - chevron_width, y_pos), #4
+ (cds_start, y_pos), #5
+ (cds_start, y_pos + bar_thickness), #1
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap['CDS'] )
+ panel.add_patch(patch)
+
+ # we must draw the splice junction
+ if i < len(exonlist) - 1:
+ next_start = exonlist[i+1].start
+ next_stop = exonlist[i+1].stop
+ middle = cds_stop + ((next_start - cds_stop)/2)
+
+ verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
+ (middle, y_pos + 0.95), #b
+ (next_start, y_pos + bar_thickness), #c
+ (next_start, y_pos + bar_thickness - 0.05), #d
+ (middle, y_pos + 0.95 - 0.05), #e
+ (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f
+ (cds_stop - chevron_width, y_pos + bar_thickness), #2/a
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap['CDS'] )
+ panel.add_patch(patch)
+
+ return panel
+
+def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None):
+ """ slightly different from the above version such thatsplice junctions
+ are more visually explicit.
+
+ plots a left to right patch with introns when there are no intervening
+ sequences to consider. Uses a gene id and gffutils database as input.
+ b
+ a .-=^=-. c
+ 1_____________2---/ e `---1_____________2
+ | #lff /f d| #lff /
+ | left to / | left to /
+ | right / | right /
+ 4_________/3 4_________/3
+ """
+ #first we need to determine the number of exons
+ bar_thickness = 0.75
+ #now we can start plotting the exons
+ exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
+ for i in range(len(exonlist)):
+ cds_start = exonlist[i].start
+ cds_stop = exonlist[i].stop
+ verts = [(cds_start, y_pos + bar_thickness), #1
+ (cds_stop, y_pos + bar_thickness), #2
+ (cds_stop - chevron_width, y_pos), #4
+ (cds_start, y_pos), #5
+ (cds_start, y_pos + bar_thickness), #1
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap['CDS'] )
+ panel.add_patch(patch)
+
+ # we must draw the splice junction
+ if i < len(exonlist) - 1:
+ next_start = exonlist[i+1].start
+ next_stop = exonlist[i+1].stop
+ middle = cds_stop + ((next_start - cds_stop)/2)
+
+ verts = [(cds_stop-5, y_pos + bar_thickness), #2/a
+ (middle, y_pos + 0.95), #b
+ (next_start, y_pos + bar_thickness), #c
+ (next_start, y_pos + bar_thickness - 0.05), #d
+ (middle, y_pos + 0.95 - 0.05), #e
+ (cds_stop-5, y_pos + bar_thickness -0.05), #f
+ (cds_stop-5, y_pos + bar_thickness), #2/a
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap['CDS'] )
+ panel.add_patch(patch)
+
+ return panel
+
+def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
+ """ plots a lff patch
+ 1__________2 ____________
+ | #lff \ \ #rff \
+ | left for \3 \ right for \
+ | forward / / forward /
+ 5___________/4 /___________/
+ """
+ #if there is only one feature to plot, then just plot it
+
+ print("plotting lff")
+ verts = [(left_df['start'], y_pos + bar_thickness), #1
+ (right_df['start'] - chevron_width, y_pos + bar_thickness), #2
+ (left_df['stop'], y_pos + (bar_thickness/2)), #3
+ (right_df['start'] - chevron_width, y_pos), #4
+ (left_df['start'], y_pos), #5
+ (left_df['start'], y_pos + bar_thickness), #1
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap[left_df['featType']] )
+ text_width = left_df['width']
+ if text and text_width >= min_text:
+ panel = _plot_label(panel, left_df, y_pos, bar_thickness)
+ elif text and text_width < min_text and text_width >= text_cutoff:
+ panel = _plot_label(panel, left_df,
+ y_pos, bar_thickness,
+ rotate = True, arrow = True)
+
+ return panel, patch
+
+def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False):
+ # handles the case where a dataframe was passed
+ fontsize = 8
+ rotation = 0
+ if rotate:
+ fontsize = 5
+ rotation = 90
+ if len(df) == 1:
+ x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start']
+ y = y_pos + (bar_thickness/2)
+ # if we need to center somewhere other than the arrow, need to adjust
+ # for the direction of the arrow
+ # it doesn't look good if it shifts by the whole arrow width, so only
+ # shift by half the arrow width
+ if arrow:
+ if df.loc[0, 'strand'] == "+":
+ shift_start = df.loc[0, 'start']
+ else:
+ shift_start = df.loc[0, 'start'] + (arrow_width/2)
+ x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start
+ panel.text(x, y,
+ df.loc[0, 'name'], fontsize = fontsize,
+ ha='center', va='center',
+ color = 'white', family = 'monospace',
+ zorder = 100, rotation = rotation)
+ # and the case where a series was passed
+ else:
+ x = ((df['stop'] - df['start'])/2) + df['start']
+ y = y_pos + (bar_thickness/2)
+ if arrow:
+ if df['strand'] == "+":
+ shift_start = df['start']
+ else:
+ shift_start = df['start'] + (arrow_width/2)
+ x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start
+ panel.text(x, y,
+ df['name'], fontsize = fontsize,
+ ha='center', va='center',
+ color = 'white', family = 'monospace',
+ zorder = 100, rotation = rotation)
+
+ return panel
+
+def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
+ """ plots a rff patch
+ ____________ 1__________2
+ | #lff \ \ #rff \
+ | left for \ 6\ right for \3
+ | forward / / forward /
+ |___________/ /5__________/4
+ """
+ #if there is only one feature to plot, then just plot it
+
+ print("plotting rff")
+ verts = [(right_df['start'], y_pos + bar_thickness), #1
+ (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2
+ (right_df['stop'], y_pos + (bar_thickness/2)), #3
+ (right_df['stop'] - arrow_width, y_pos), #4
+ (right_df['start'], y_pos), #5
+ (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6
+ (right_df['start'], y_pos + bar_thickness), #1
+ ]
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+ ]
+ path = Path(verts, codes)
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap[right_df['featType']] )
+ text_width = right_df['width']
+ if text and text_width >= min_text:
+ panel = _plot_label(panel, right_df, y_pos, bar_thickness)
+ elif text and text_width < min_text and text_width >= text_cutoff:
+ panel = _plot_label(panel, right_df,
+ y_pos, bar_thickness, rotate = True)
+ return panel, patch
+
+def x_offset_gff(GFFParseobj, x_offset):
+ """Takes in a gff object (a gff file parsed as a pandas dataframe),
+ and an x_offset value and shifts the start, stop, center, lmost, and rmost.
+
+ Returns a GFFParse object with the shifted values in GFFParse.features.
+ """
+ for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']:
+ GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset
+ return GFFParseobj
+
+def gffplot_horizontal(figure, panel, args, gff_object,
+ track_width=0.2, start_y=0.1, **kwargs):
+ """
+ this plots horizontal things from gff files. it was probably written for synplot,
+ as the browser does not use this at all.
+ """
+ # Because this size should be relative to the circle that it is plotted next
+ # to, define the start_radius as the place to work from, and the width of
+ # each track.
+ colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
+ 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'}
+ augment = 0
+ bar_thickness = 0.9 * track_width
+ # return these at the end
+ myPatches=[]
+ plot_order = []
+
+ idone = False
+ # we need to filter out the tRNAs since those are plotted last
+ plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
+ plottable_features.reset_index(inplace=True, drop=True)
+ print(plottable_features)
+
+ len_plottable = len(plottable_features)
+ print('len plottable', len_plottable)
+ # - this for loop relies on the gff features to already be sorted
+ # - The algorithm for this loop works by starting at the 0th index of the
+ # plottable features (i).
+ # - It then looks to see if the next object (the jth) overlaps with the
+ # ith element.
+ i = 0
+ j = 1
+ while i < len(plottable_features):
+ if i + j == len(plottable_features):
+ #we have run off of the df and need to include everything from i to the end
+ these_features = plottable_features.loc[i::,].copy(deep=True)
+ these_features = these_features.reset_index()
+ print(these_features)
+ plot_order.append(these_features)
+ i = len(plottable_features)
+ break
+ print(" - i,j are currently: {},{}".format(i, j))
+ stop = plottable_features.loc[i]["stop"]
+ start = plottable_features.loc[i+j]["start"]
+ print("stop: {}. start: {}.".format(stop, start))
+ if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]:
+ print(" - putting elements {} through (including) {} together".format(i, i+j))
+ these_features = plottable_features.loc[i:i+j-1,].copy(deep=True)
+ these_features = these_features.reset_index()
+ print(these_features)
+ plot_order.append(these_features)
+ i += 1
+ j = 1
+ else:
+ j += 1
+
+ #while idone == False:
+ # print("im in the overlap-pairing while loop i={}".format(i))
+ # # look ahead at all of the elements that overlap with the ith element
+ # jdone = False
+ # j = 1
+ # this_set_minimum_index = i
+ # this_set_maximum_index = i
+ # while jdone == False:
+ # print("new i= {} j={} len={}".format(i, j, len_plottable))
+ # print("len plottable in jdone: {}".format(len_plottable))
+ # print("plottable features in jdone:\n {}".format(plottable_features))
+ # # first make sure that we haven't gone off the end of the dataframe
+ # # This is an edge case where i has a jth element that overlaps with it,
+ # # and j is the last element in the plottable features.
+ # if i+j == len_plottable:
+ # print("i+j == len_plottable")
+ # # this checks for the case that i is the last element of the
+ # # plottable features.
+ # # In both of the above cases, we are done with both the ith and
+ # # the jth features.
+ # if i == len_plottable-1:
+ # print("i == len_plottable-1")
+
+ # # this is the last analysis, so set idone to true
+ # # to finish after this
+ # idone = True
+ # # the last one can't be in its own group, so just add it solo
+ # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
+ # plot_order.append(these_features.reset_index(drop=True))
+ # break
+ # jdone = True
+ # else:
+ # print("i+j != len_plottable")
+ # # if the lmost of the next gene overlaps with the rmost of
+ # # the current one, it overlaps and couple together
+ # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
+ # print("lmost < rmost")
+ # # note that this feature overlaps with the current
+ # this_set_maximum_index = i+j
+ # # ... and we need to look at the next in line
+ # j += 1
+ # else:
+ # print("lmost !< rmost")
+ # i += 1 + (this_set_maximum_index - this_set_minimum_index)
+ # #add all of the things that grouped together once we don't find any more groups
+ # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
+ # plot_order.append(these_features.reset_index(drop=True))
+ # jdone = True
+ # print("plot order is now: {}".format(plot_order))
+ # print("jdone: {}".format(str(jdone)))
+
+ for feature_set in plot_order:
+ # plot_feature_hori handles overlapping cases as well as normal cases
+ panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap,
+ start_y, bar_thickness, text = True)
+ for each in patches:
+ print("there are {} patches after gffplot_feature_hori".format(len(patches)))
+ print(each)
+ myPatches.append(each)
+ print("length of myPatches is: {}".format(len(myPatches)))
+
+ # Now we add all of the tRNAs to this to plot, do it last to overlay
+ # everything else
+ tRNAs = gff_object.features.query("featType == 'tRNA'")
+ tRNAs.reset_index(inplace=True, drop = True)
+ tRNA_bar_thickness = bar_thickness * (0.8)
+ tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2)
+ for i in range(0,len(tRNAs)):
+ this_feature = tRNAs[i:i+1].copy(deep=True)
+ this_feature.reset_index(inplace=True, drop = True)
+ panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap,
+ tRNA_start_y, tRNA_bar_thickness, text = True)
+ for patch in patches:
+ myPatches.append(patch)
+ print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches)))
+ return panel, myPatches
+
+def gffplot_feature_hori(figure, panel, feature_df,
+ colorMap, y_pos, bar_thickness, text=True):
+ """This plots the track for a feature, and if there is something for
+ 'this_feature_overlaps_feature', then there is special processing to
+ add the white bar and the extra slope for the chevron
+ """
+ myPatches = []
+ #if there is only one feature to plot, then just plot it
+ if len(feature_df) == 1:
+ #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1],
+ # str(feature_df['featType']).split()[1] ))
+ #print(this_feature['name'], "is not overlapping")
+ # This plots this shape: 1_________2 2_________1
+ # | forward \3 3/ reverse |
+ # |5__________/4 \4________5|
+ if feature_df.loc[0,'strand'] == '+':
+ verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1
+ (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2
+ (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3
+ (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4
+ (feature_df.loc[0, 'start'], y_pos), #5
+ (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1
+ elif feature_df.loc[0,'strand'] == '-':
+ verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1
+ (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2
+ (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3
+ (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4
+ (feature_df.loc[0, 'stop'], y_pos), #5
+ (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1
+ feat_width = feature_df.loc[0,'width']
+ if text and feat_width >= min_text:
+ panel = _plot_label(panel, feature_df.loc[0,],
+ y_pos, bar_thickness)
+ elif text and feat_width < min_text and feat_width >= text_cutoff:
+ panel = _plot_label(panel, feature_df.loc[0,],
+ y_pos, bar_thickness,
+ rotate = True, arrow = True)
+
+ codes = [Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY]
+ path = Path(verts, codes)
+ print("normal path is: {}".format(path))
+ # If the feature itself is smaller than the arrow, we need to take special measures to
+ if feature_df.loc[0,'width'] <= arrow_width:
+ path = Path([verts[i] for i in [0,2,4,5]],
+ [codes[i] for i in [0,2,4,5]])
+ patch = patches.PathPatch(path, lw = 0,
+ fc=colorMap[feature_df.loc[0, 'featType']] )
+ myPatches.append(patch)
+ # there are four possible scenarios if there are two overlapping sequences:
+ # ___________ ____________ ____________ ___________
+ # | #1 \ \ #1 \ / #2 / / #2 |
+ # | both seqs \ \ both seqs \ / both seqs / / both seqs |
+ # | forward / / forward / \ reverse \ \ reverse |
+ # |__________/ /___________/ \___________\ \___________|
+ # ___________ _____________ ____________ _ _________
+ # | #3 \ \ #3 | / #2 _| #2 \
+ # | one seq \ \ one seq | / one seq |_ one seq \
+ # | forward \ \ reverse | \ reverse _| forward /
+ # |_____________\ \_________| \__________|_ ___________/
+ #
+ # These different scenarios can be thought of as different left/right
+ # flanking segment types.
+ # In the annotation #rff:
+ # - 'r' refers to the annotation type as being on the right
+ # - the first 'f' refers to the what element is to the left of this one.
+ # Since it is forward the 5' end of this annotation must be a chevron
+ # - the second 'f' refers to the right side of this element. Since it is
+ # forward it must be a normal arrow.
+ # being on the right
+ #
+ # *LEFT TYPES* *RIGHT TYPES*
+ # ____________ ____________
+ # | #lff \ \ #rff \
+ # | left for \ \ right for \
+ # | forward / / forward /
+ # |___________/ /___________/
+ # ___________ _____________
+ # | #lfr \ \ #rfr |
+ # | left for \ \ right for |
+ # | reverse \ \ reverse |
+ # |_____________\ \_________|
+ # ____________ ___________
+ # / #lrr / / #rrr |
+ # / left rev / / right rev |
+ # \ reverse \ \ reverse |
+ # \___________\ \___________|
+ # ____________ __________
+ # / #lrf _| _| #rrf \
+ # / left rev |_ | _ right rev \
+ # \ forward _| _| forward /
+ # \__________| |____________/
+ #
+ # To properly plot these elements, we must go through each element of the
+ # feature_df to determine which patch type it is.
+ elif len(feature_df) == 2:
+ print("im in here feat len=2")
+ for i in range(len(feature_df)):
+ # this tests for which left type we're dealing with
+ if i == 0:
+ # type could be lff or lfr
+ if feature_df.loc[i, 'strand'] == '+':
+ if feature_df.loc[i + 1, 'strand'] == '+':
+ # plot a lff type
+ panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,],
+ colorMap, y_pos, bar_thickness, text)
+ myPatches.append(patch)
+ elif feature_df.loc[i + 1, 'strand'] == '-':
+ #plot a lfr type
+ raise IOError("can't plot {} patches yet".format("lfr"))
+ # or type could be lrr or lrf
+ elif feature_df.loc[i, 'strand'] == '-':
+ if feature_df.loc[i + 1, 'strand'] == '+':
+ # plot a lrf type
+ raise IOError("can't plot {} patches yet".format("lrf"))
+ elif feature_df.loc[i + 1, 'strand'] == '-':
+ #plot a lrr type
+ raise IOError("can't plot {} patches yet".format("lrr"))
+ # in this case we're only dealing with 'right type' patches
+ elif i == len(feature_df) - 1:
+ # type could be rff or rfr
+ if feature_df.loc[i-1, 'strand'] == '+':
+ if feature_df.loc[i, 'strand'] == '+':
+ # plot a rff type
+ panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,],
+ colorMap, y_pos, bar_thickness, text)
+ myPatches.append(patch)
+ elif feature_df.loc[i, 'strand'] == '-':
+ #plot a rfr type
+ raise IOError("can't plot {} patches yet".format("rfr"))
+ # or type could be rrr or rrf
+ elif feature_df.loc[i-1, 'strand'] == '-':
+ if feature_df.loc[i, 'strand'] == '+':
+ # plot a rrf type
+ raise IOError("can't plot {} patches yet".format("rrf"))
+ elif feature_df.loc[i, 'strand'] == '-':
+ #plot a rrr type
+ raise IOError("can't plot {} patches yet".format("rrr"))
+ return panel, myPatches
diff --git a/pauvre/lsi/Q.py b/pauvre/lsi/Q.py
new file mode 100755
index 0000000..96563e4
--- /dev/null
+++ b/pauvre/lsi/Q.py
@@ -0,0 +1,77 @@
+# Binary search tree that holds status of sweep line. Only leaves hold values.
+# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
+# Author: Sam Lichtenberg
+# Email: splichte@princeton.edu
+# Date: 09/02/2013
+
+from pauvre.lsi.helper import *
+
+ev = 0.00000001
+
+class Q:
+ def __init__(self, key, value):
+ self.key = key
+ self.value = value
+ self.left = None
+ self.right = None
+
+ def find(self, key):
+ if self.key is None:
+ return False
+ c = compare_by_y(key, self.key)
+ if c==0:
+ return True
+ elif c==-1:
+ if self.left:
+ self.left.find(key)
+ else:
+ return False
+ else:
+ if self.right:
+ self.right.find(key)
+ else:
+ return False
+ def insert(self, key, value):
+ if self.key is None:
+ self.key = key
+ self.value = value
+ c = compare_by_y(key, self.key)
+ if c==0:
+ self.value += value
+ elif c==-1:
+ if self.left is None:
+ self.left = Q(key, value)
+ else:
+ self.left.insert(key, value)
+ else:
+ if self.right is None:
+ self.right = Q(key, value)
+ else:
+ self.right.insert(key, value)
+ # must return key AND value
+ def get_and_del_min(self, parent=None):
+ if self.left is not None:
+ return self.left.get_and_del_min(self)
+ else:
+ k = self.key
+ v = self.value
+ if parent:
+ parent.left = self.right
+ # i.e. is root node
+ else:
+ if self.right:
+ self.key = self.right.key
+ self.value = self.right.value
+ self.left = self.right.left
+ self.right = self.right.right
+ else:
+ self.key = None
+ return k,v
+
+ def print_tree(self):
+ if self.left:
+ self.left.print_tree()
+ print(self.key)
+ print(self.value)
+ if self.right:
+ self.right.print_tree()
diff --git a/pauvre/lsi/T.py b/pauvre/lsi/T.py
new file mode 100755
index 0000000..06554d7
--- /dev/null
+++ b/pauvre/lsi/T.py
@@ -0,0 +1,322 @@
+# Binary search tree that holds status of sweep line. Only leaves hold values.
+# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
+# Author: Sam Lichtenberg
+# Email: splichte@princeton.edu
+# Date: 09/02/2013
+
+from pauvre.lsi.helper import *
+
+ev = 0.00000001
+
+class T:
+ def __init__(self):
+ self.root = Node(None, None, None, None)
+ def contain_p(self, p):
+ if self.root.value is None:
+ return [[], []]
+ lists = [[], []]
+ self.root.contain_p(p, lists)
+ return (lists[0], lists[1])
+ def get_left_neighbor(self, p):
+ if self.root.value is None:
+ return None
+ return self.root.get_left_neighbor(p)
+ def get_right_neighbor(self, p):
+ if self.root.value is None:
+ return None
+ return self.root.get_right_neighbor(p)
+ def insert(self, key, s):
+ if self.root.value is None:
+ self.root.left = Node(s, None, None, self.root)
+ self.root.value = s
+ self.root.m = get_slope(s)
+ else:
+ (node, path) = self.root.find_insert_pt(key, s)
+ if path == 'r':
+ node.right = Node(s, None, None, node)
+ node.right.adjust()
+ elif path == 'l':
+ node.left = Node(s, None, None, node)
+ else:
+ # this means matching Node was a leaf
+ # need to make a new internal Node
+ if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1):
+ new_internal = Node(s, None, node, node.parent)
+ new_leaf = Node(s, None, None, new_internal)
+ new_internal.left = new_leaf
+ if node is node.parent.left:
+ node.parent.left = new_internal
+ node.adjust()
+ else:
+ node.parent.right = new_internal
+ else:
+ new_internal = Node(node.value, node, None, node.parent)
+ new_leaf = Node(s, None, None, new_internal)
+ new_internal.right = new_leaf
+ if node is node.parent.left:
+ node.parent.left = new_internal
+ new_leaf.adjust()
+ else:
+ node.parent.right = new_internal
+ node.parent = new_internal
+
+ def delete(self, p, s):
+ key = p
+ node = self.root.find_delete_pt(key, s)
+ val = node.value
+ if node is node.parent.left:
+ parent = node.parent.parent
+ if parent is None:
+ if self.root.right is not None:
+ if self.root.right.left or self.root.right.right:
+ self.root = self.root.right
+ self.root.parent = None
+ else:
+ self.root.left = self.root.right
+ self.root.value = self.root.right.value
+ self.root.m = self.root.right.m
+ self.root.right = None
+ else:
+ self.root.left = None
+ self.root.value = None
+ elif node.parent is parent.left:
+ parent.left = node.parent.right
+ node.parent.right.parent = parent
+ else:
+ parent.right = node.parent.right
+ node.parent.right.parent = parent
+ else:
+ parent = node.parent.parent
+ if parent is None:
+ if self.root.left:
+ # switch properties
+ if self.root.left.right or self.root.left.left:
+ self.root = self.root.left
+ self.root.parent = None
+ else:
+ self.root.right = None
+ else:
+ self.root.right = None
+ self.root.value = None
+ elif node.parent is parent.left:
+ parent.left = node.parent.left
+ node.parent.left.parent = parent
+ farright = node.parent.left
+ while farright.right is not None:
+ farright = farright.right
+ farright.adjust()
+ else:
+ parent.right = node.parent.left
+ node.parent.left.parent = parent
+ farright = node.parent.left
+ while farright.right is not None:
+ farright = farright.right
+ farright.adjust()
+ return val
+
+ def print_tree(self):
+ self.root.print_tree()
+class Node:
+ def __init__(self, value, left, right, parent):
+ self.value = value # associated line segment
+ self.left = left
+ self.right = right
+ self.parent = parent
+ self.m = None
+ if value is not None:
+ self.m = get_slope(value)
+
+ # compares line segment at y-val of p to p
+ # TODO: remove this and replace with get_x_at
+ def compare_to_key(self, p):
+ x0 = self.value[0][0]
+ y0 = self.value[0][1]
+ y1 = p[1]
+ if self.m != 0 and self.m is not None:
+ x1 = x0 - float(y0-y1)/self.m
+ return compare_by_x(p, (x1, y1))
+ else:
+ x1 = p[0]
+ return 0
+
+ def get_left_neighbor(self, p):
+ neighbor = None
+ n = self
+ if n.left is None and n.right is None:
+ return neighbor
+ last_right = None
+ found = False
+ while not found:
+ c = n.compare_to_key(p)
+ if c < 1 and n.left:
+ n = n.left
+ elif c==1 and n.right:
+ n = n.right
+ last_right = n.parent
+ else:
+ found = True
+ c = n.compare_to_key(p)
+ if c==0:
+ if n is n.parent.right:
+ return n.parent
+ else:
+ goright = None
+ if last_right:
+ goright =last_right.left
+ return self.get_lr(None, goright)[0]
+ # n stores the highest-value in the left subtree
+ if c==-1:
+ goright = None
+ if last_right:
+ goright = last_right.left
+ return self.get_lr(None, goright)[0]
+ if c==1:
+ neighbor = n
+ return neighbor
+
+ def get_right_neighbor(self, p):
+ neighbor = None
+ n = self
+ if n.left is None and n.right is None:
+ return neighbor
+ last_left = None
+ found = False
+ while not found:
+ c = n.compare_to_key(p)
+ if c==0 and n.right:
+ n = n.right
+ elif c < 0 and n.left:
+ n = n.left
+ last_left = n.parent
+ elif c==1 and n.right:
+ n = n.right
+ else:
+ found = True
+ c = n.compare_to_key(p)
+ # can be c==0 and n.left if at root node
+ if c==0:
+ if n.parent is None:
+ return None
+ if n is n.parent.right:
+ goleft = None
+ if last_left:
+ goleft = last_left.right
+ return self.get_lr(goleft, None)[1]
+ else:
+ return self.get_lr(n.parent.right, None)[1]
+ if c==1:
+ goleft = None
+ if last_left:
+ goleft = last_left.right
+ return self.get_lr(goleft, None)[1]
+ if c==-1:
+ return n
+ return neighbor
+
+ # travels down a single direction to get neighbors
+ def get_lr(self, left, right):
+ lr = [None, None]
+ if left:
+ while left.left:
+ left = left.left
+ lr[1] = left
+ if right:
+ while right.right:
+ right = right.right
+ lr[0] = right
+ return lr
+
+ def contain_p(self, p, lists):
+ c = self.compare_to_key(p)
+ if c==0:
+ if self.left is None and self.right is None:
+ if compare_by_x(p, self.value[1])==0:
+ lists[1].append(self.value)
+ else:
+ lists[0].append(self.value)
+ if self.left:
+ self.left.contain_p(p, lists)
+ if self.right:
+ self.right.contain_p(p, lists)
+ elif c < 0:
+ if self.left:
+ self.left.contain_p(p, lists)
+ else:
+ if self.right:
+ self.right.contain_p(p, lists)
+
+ def find_insert_pt(self, key, seg):
+ if self.left and self.right:
+ if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
+ return self.right.find_insert_pt(key, seg)
+ elif self.compare_to_key(key) < 1:
+ return self.left.find_insert_pt(key, seg)
+ else:
+ return self.right.find_insert_pt(key, seg)
+ # this case only happens at root
+ elif self.left:
+ if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
+ return (self, 'r')
+ elif self.compare_to_key(key) < 1:
+ return self.left.find_insert_pt(key, seg)
+ else:
+ return (self, 'r')
+ else:
+ return (self, 'n')
+
+ # adjusts stored segments in inner nodes
+ def adjust(self):
+ value = self.value
+ m = self.m
+ parent = self.parent
+ node = self
+ # go up left as much as possible
+ while parent and node is parent.right:
+ node = parent
+ parent = node.parent
+ # parent to adjust will be on the immediate right
+ if parent and node is parent.left:
+ parent.value = value
+ parent.m = m
+
+ def compare_lower(self, p, s2):
+ y = p[1] - 10
+ key = get_x_at(s2, (p[0], y))
+ return self.compare_to_key(key)
+
+ # returns matching leaf node, or None if no match
+ # when deleting, you don't delete below--you delete above! so compare lower = -1.
+ def find_delete_pt(self, key, value):
+ if self.left and self.right:
+ # if equal at this pt, and this node's value is less than the seg's slightly above this pt
+ if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1:
+ return self.right.find_delete_pt(key, value)
+ if self.compare_to_key(key) < 1:
+ return self.left.find_delete_pt(key, value)
+ else:
+ return self.right.find_delete_pt(key, value)
+ elif self.left:
+ if self.compare_to_key(key) < 1:
+ return self.left.find_delete_pt(key, value)
+ else:
+ return None
+ # is leaf
+ else:
+ if self.compare_to_key(key)==0 and segs_equal(self.value, value):
+ return self
+ else:
+ return None
+
+ # also prints depth of each node
+ def print_tree(self, l=0):
+ l += 1
+ if self.left:
+ self.left.print_tree(l)
+ if self.left or self.right:
+ print('INTERNAL: {0}'.format(l))
+ else:
+ print('LEAF: {0}'.format(l))
+ print(self)
+ print(self.value)
+ if self.right:
+ self.right.print_tree(l)
diff --git a/pauvre/lsi/__init__.py b/pauvre/lsi/__init__.py
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/pauvre/lsi/__init__.py
diff --git a/pauvre/lsi/helper.py b/pauvre/lsi/helper.py
new file mode 100755
index 0000000..dac30ec
--- /dev/null
+++ b/pauvre/lsi/helper.py
@@ -0,0 +1,94 @@
+# Helper functions for use in the lsi implementation.
+
+ev = 0.0000001
+# floating-point comparison
+def approx_equal(a, b, tol):
+ return abs(a - b) < tol
+
+# compares x-values of two pts
+# used for ordering in T
+def compare_by_x(k1, k2):
+ if approx_equal(k1[0], k2[0], ev):
+ return 0
+ elif k1[0] < k2[0]:
+ return -1
+ else:
+ return 1
+
+# higher y value is "less"; if y value equal, lower x value is "less"
+# used for ordering in Q
+def compare_by_y(k1, k2):
+ if approx_equal(k1[1], k2[1], ev):
+ if approx_equal(k1[0], k2[0], ev):
+ return 0
+ elif k1[0] < k2[0]:
+ return -1
+ else:
+ return 1
+ elif k1[1] > k2[1]:
+ return -1
+ else:
+ return 1
+
+# tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders)
+def segs_equal(s0, s1):
+ x00 = s0[0][0]
+ y00 = s0[0][1]
+ x01 = s0[1][0]
+ y01 = s0[1][1]
+ x10 = s1[0][0]
+ y10 = s1[0][1]
+ x11 = s1[1][0]
+ y11 = s1[1][1]
+ if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)):
+ if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)):
+ return True
+ if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)):
+ if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)):
+ return True
+ return False
+
+# get m for a given seg in (p1, p2) form
+def get_slope(s):
+ x0 = s[0][0]
+ y0 = s[0][1]
+ x1 = s[1][0]
+ y1 = s[1][1]
+ if (x1-x0)==0:
+ return None
+ else:
+ return float(y1-y0)/(x1-x0)
+
+# given a point p, return the point on s that shares p's y-val
+def get_x_at(s, p):
+ m = get_slope(s)
+ # TODO: this should check if p's x-val is octually on seg; we're assuming
+ # for now that it would have been deleted already if not
+ if m == 0: # horizontal segment
+ return p
+ # ditto; should check if y-val on seg
+ if m is None: # vertical segment
+ return (s[0][0], p[1])
+ x1 = s[0][0]-(s[0][1]-p[1])/m
+ return (x1, p[1])
+
+# returns the point at which two line segments intersect, or None if no intersection.
+def intersect(seg1, seg2):
+ p = seg1[0]
+ r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1])
+ q = seg2[0]
+ s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1])
+ denom = r[0]*s[1]-r[1]*s[0]
+ if denom == 0:
+ return None
+ numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0]
+ t = numer/denom
+ numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0]
+ u = numer/denom
+ if (t < 0 or t > 1) or (u < 0 or u > 1):
+ return None
+ x = p[0]+t*r[0]
+ y = p[1]+t*r[1]
+ return (x, y)
+
+
diff --git a/pauvre/lsi/lsi.py b/pauvre/lsi/lsi.py
new file mode 100755
index 0000000..cb2f533
--- /dev/null
+++ b/pauvre/lsi/lsi.py
@@ -0,0 +1,107 @@
+# Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2.
+# See README for more information.
+# Author: Sam Lichtenberg
+# Email: splichte@princeton.edu
+# Date: 09/02/2013
+
+from pauvre.lsi.Q import Q
+from pauvre.lsi.T import T
+from pauvre.lsi.helper import *
+
+# "close enough" for floating point
+ev = 0.00000001
+
+# how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left
+lower_check = 100
+
+# gets the point on a segment at a lower y value.
+def getNextPoint(p, seg, y_lower):
+ p1 = seg[0]
+ p2 = seg[1]
+ if (p1[0]-p2[0])==0:
+ return (p[0]+10, p[1])
+ slope = float(p1[1]-p2[1])/(p1[0]-p2[0])
+ if slope==0:
+ return (p1[0], p[1]-y_lower)
+ y = p[1]-y_lower
+ x = p1[0]-(p1[1]-y)/slope
+ return (x, y)
+
+"""
+for each event point:
+ U_p = segments that have p as an upper endpoint
+ C_p = segments that contain p
+ L_p = segments that have p as a lower endpoint
+"""
+def handle_event_point(p, segs, q, t, intersections):
+ rightmost = (float("-inf"), 0)
+ rightmost_seg = None
+ leftmost = (float("inf"), 0)
+ leftmost_seg = None
+
+ U_p = segs
+ (C_p, L_p) = t.contain_p(p)
+ merge_all = U_p+C_p+L_p
+ if len(merge_all) > 1:
+ intersections[p] = []
+ for s in merge_all:
+ intersections[p].append(s)
+ merge_CL = C_p+L_p
+ merge_UC = U_p+C_p
+ for s in merge_CL:
+ # deletes at a point slightly above (to break ties) - where seg is located in tree
+ # above intersection point
+ t.delete(p, s)
+ # put segments into T based on where they are at y-val just below p[1]
+ for s in merge_UC:
+ n = getNextPoint(p, s, lower_check)
+ if n[0] > rightmost[0]:
+ rightmost = n
+ rightmost_seg = s
+ if n[0] < leftmost[0]:
+ leftmost = n
+ leftmost_seg = s
+ t.insert(p, s)
+
+ # means only L_p -> check newly-neighbored segments
+ if len(merge_UC) == 0:
+ neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p))
+ if neighbors[0] and neighbors[1]:
+ find_new_event(neighbors[0].value, neighbors[1].value, p, q)
+
+ # of newly inserted pts, find possible intersections to left and right
+ else:
+ left_neighbor = t.get_left_neighbor(p)
+ if left_neighbor:
+ find_new_event(left_neighbor.value, leftmost_seg, p, q)
+ right_neighbor = t.get_right_neighbor(p)
+ if right_neighbor:
+ find_new_event(right_neighbor.value, rightmost_seg, p, q)
+
+def find_new_event(s1, s2, p, q):
+ i = intersect(s1, s2)
+ if i:
+ if compare_by_y(i, p) == 1:
+ if not q.find(i):
+ q.insert(i, [])
+
+# segment is in ((x, y), (x, y)) form
+# first pt in a segment should have higher y-val - this is handled in function
+def intersection(S):
+ s0 = S[0]
+ if s0[1][1] > s0[0][1]:
+ s0 = (s0[1], s0[0])
+ q = Q(s0[0], [s0])
+ q.insert(s0[1], [])
+ intersections = {}
+ for s in S[1:]:
+ if s[1][1] > s[0][1]:
+ s = (s[1], s[0])
+ q.insert(s[0], [s])
+ q.insert(s[1], [])
+ t = T()
+ while q.key:
+ p, segs = q.get_and_del_min()
+ handle_event_point(p, segs, q, t, intersections)
+ return intersections
+
diff --git a/pauvre/lsi/test.py b/pauvre/lsi/test.py
new file mode 100755
index 0000000..8bae308
--- /dev/null
+++ b/pauvre/lsi/test.py
@@ -0,0 +1,110 @@
+# Test file for lsi.
+# Author: Sam Lichtenberg
+# Email: splichte@princeton.edu
+# Date: 09/02/2013
+
+from lsi import intersection
+import random
+import time, sys
+from helper import *
+
+ev = 0.00000001
+
+def scale(i):
+ return float(i)
+
+use_file = None
+try:
+ use_file = sys.argv[2]
+except:
+ pass
+
+if not use_file:
+ S = []
+ for i in range(int(sys.argv[1])):
+ p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
+ p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
+ s = (p1, p2)
+ S.append(s)
+ f = open('input', 'w')
+ f.write(str(S))
+ f.close()
+
+else:
+ f = open(sys.argv[2], 'r')
+ S = eval(f.read())
+
+intersections = []
+seen = []
+vs = False
+hs = False
+es = False
+now = time.time()
+for seg1 in S:
+ if approx_equal(seg1[0][0], seg1[1][0], ev):
+ print 'VERTICAL SEG'
+ print ''
+ print ''
+ vs = True
+ if approx_equal(seg1[0][1], seg1[1][1], ev):
+ print 'HORIZONTAL SEG'
+ print ''
+ print ''
+ hs = True
+ for seg2 in S:
+ if seg1 is not seg2 and segs_equal(seg1, seg2):
+ print 'EQUAL SEGS'
+ print ''
+ print ''
+ es = True
+ if seg1 is not seg2 and (seg2, seg1) not in seen:
+ i = intersect(seg1, seg2)
+ if i:
+ intersections.append((i, [seg1, seg2]))
+ # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]]
+ # xpts = sorted(xpts)
+ # if (i[0] <= xpts[2] and i[0] >= xpts[1]:
+ # intersections.append((i, [seg1, seg2]))
+ seen.append((seg1, seg2))
+later = time.time()
+n2time = later-now
+print "Line sweep results:"
+now = time.time()
+lsinters = intersection(S)
+inters = []
+for k, v in lsinters.iteritems():
+ #print '{0}: {1}'.format(k, v)
+ inters.append(k)
+# inters.append(v)
+later = time.time()
+print 'TIME ELAPSED: {0}'.format(later-now)
+print "N^2 comparison results:"
+pts_seen = []
+highestseen = 0
+for i in intersections:
+ seen_already = False
+ seen = 0
+ for p in pts_seen:
+ if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev):
+ seen += 1
+ seen_already = True
+ if seen > highestseen:
+ highestseen = seen
+ if not seen_already:
+ pts_seen.append(i[0])
+ in_k = False
+ for k in inters:
+ if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev):
+ in_k = True
+ if in_k == False:
+ print 'Not in K: {0}: {1}'.format(i[0], i[1])
+# print i
+print highestseen
+print 'TIME ELAPSED: {0}'.format(n2time)
+#print 'Missing from line sweep but in N^2:'
+#for i in seen:
+# matched = False
+print len(lsinters)
+print len(pts_seen)
+if len(lsinters) != len(pts_seen):
+ print 'uh oh!'
diff --git a/pauvre/marginplot.py b/pauvre/marginplot.py
new file mode 100644
index 0000000..53da1ae
--- /dev/null
+++ b/pauvre/marginplot.py
@@ -0,0 +1,401 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre
+# Copyright (c) 2016-2020 Darrin T. Schultz.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+import ast
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import matplotlib.patches as mplpatches
+from matplotlib.colors import LinearSegmentedColormap
+import numpy as np
+import pandas as pd
+import os.path as opath
+from sys import stderr
+from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual
+from pauvre.stats import stats
+import pauvre.rcparams as rc
+import logging
+
+# logging
+logger = logging.getLogger('pauvre')
+
+
+def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
+ axis_tick_param='both', which_tick_param='both',
+ bottom_tick_param=True, label_bottom_tick_param=True,
+ left_tick_param=True, label_left_tick_param=True,
+ right_tick_param=False, label_right_tick_param=False,
+ top_tick_param=False, label_top_tick_param=False):
+ """
+ Setting default panel tick parameters. Some of these are the defaults
+ for matplotlib anyway, but specifying them for readability. Here are
+ options and defaults for the parameters used below:
+
+ axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
+ which : {'major', 'minor', 'both'}; which ticks to modify;
+ default = 'major'
+ bottom, top, left, right : bool or {True, False}; ticks on or off;
+ labelbottom, labeltop, labelleft, labelright : bool or {True, False}
+ """
+
+ # create the panel
+ panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
+ panel = plt.axes(panel_rectangle)
+
+ # Set tick parameters
+ panel.tick_params(axis=axis_tick_param,
+ which=which_tick_param,
+ bottom=bottom_tick_param,
+ labelbottom=label_bottom_tick_param,
+ left=left_tick_param,
+ labelleft=label_left_tick_param,
+ right=right_tick_param,
+ labelright=label_right_tick_param,
+ top=top_tick_param,
+ labeltop=label_top_tick_param)
+
+ return panel
+
+
+def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
+ """This helper method generates the histogram that is added to the panel.
+
+ In this case, horizontal = True applies to the mean quality histogram.
+ So, horizontal = False only applies to the length histogram.
+ """
+ l_width = 0.0
+ f_color = (0.5, 0.5, 0.5)
+ e_color = (0, 0, 0)
+ if horizontal:
+ for step in np.arange(0, len(bin_values), 1):
+ left = bins[step]
+ bottom = 0
+ width = bins[step + 1] - bins[step]
+ height = bin_values[step]
+ hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
+ linewidth=l_width,
+ facecolor=f_color,
+ edgecolor=e_color)
+ panel.add_patch(hist_rectangle)
+ else:
+ for step in np.arange(0, len(bin_values), 1):
+ left = 0
+ bottom = bins[step]
+ width = bin_values[step]
+ height = bins[step + 1] - bins[step]
+
+ hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
+ linewidth=l_width,
+ facecolor=f_color,
+ edgecolor=e_color)
+ panel.add_patch(hist_rectangle)
+
+
+def generate_histogram(panel, data_list, max_plot_length, min_plot_length,
+ bin_interval, hist_horizontal=True,
+ left_spine=True, bottom_spine=True,
+ top_spine=False, right_spine=False, x_label=None,
+ y_label=None):
+
+ bins = np.arange(0, max_plot_length, bin_interval)
+
+ bin_values, bins2 = np.histogram(data_list, bins)
+
+ # hist_horizontal is used for quality
+ if hist_horizontal:
+ panel.set_xlim([min_plot_length, max_plot_length])
+ panel.set_ylim([0, max(bin_values * 1.1)])
+ # and hist_horizontal == Fale is for read length
+ else:
+ panel.set_xlim([0, max(bin_values * 1.1)])
+ panel.set_ylim([min_plot_length, max_plot_length])
+
+ # Generate histogram bin patches, depending on whether we're plotting
+ # vertically or horizontally
+ _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
+
+ panel.spines['left'].set_visible(left_spine)
+ panel.spines['bottom'].set_visible(bottom_spine)
+ panel.spines['top'].set_visible(top_spine)
+ panel.spines['right'].set_visible(right_spine)
+
+ if y_label is not None:
+ panel.set_ylabel(y_label)
+ if x_label is not None:
+ panel.set_xlabel(x_label)
+
+
+def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual,
+ max_plot_length, max_plot_qual, color, **kwargs):
+ panel.set_xlim([min_plot_qual, max_plot_qual])
+ panel.set_ylim([min_plot_length, max_plot_length])
+
+ if kwargs["kmerdf"]:
+ hex_this = data_frame.query('length<{} and numks<{}'.format(
+ max_plot_length, max_plot_qual))
+ # This single line controls plotting the hex bins in the panel
+ hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)),
+ linewidths=0.0, cmap=color)
+
+ else:
+ hex_this = data_frame.query('length<{} and meanQual<{}'.format(
+ max_plot_length, max_plot_qual))
+
+ # This single line controls plotting the hex bins in the panel
+ hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49,
+ linewidths=0.0, cmap=color)
+ for each in panel.spines:
+ panel.spines[each].set_visible(False)
+
+ counts = hex_vals.get_array()
+ return counts
+
+
+def generate_legend(panel, counts, color):
+
+ # completely custom for more control
+ panel.set_xlim([0, 1])
+ panel.set_ylim([0, 1000])
+ panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
+ panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
+ for i in np.arange(0, 1001, 1):
+ rgba = color(i / 1001)
+ alpha = rgba[-1]
+ facec = rgba[0:3]
+ hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
+ linewidth=0.0,
+ facecolor=facec,
+ edgecolor=(0, 0, 0),
+ alpha=alpha)
+ panel.add_patch(hist_rectangle)
+ panel.spines['top'].set_visible(False)
+ panel.spines['left'].set_visible(False)
+ panel.spines['bottom'].set_visible(False)
+ panel.yaxis.set_label_position("right")
+ panel.set_ylabel('Number of Reads')
+
+
+def margin_plot(df, **kwargs):
+ rc.update_rcParams()
+
+ # 250, 231, 34 light yellow
+ # 67, 1, 85
+ # R=np.linspace(65/255,1,101)
+ # G=np.linspace(0/255, 231/255, 101)
+ # B=np.linspace(85/255, 34/255, 101)
+ # R=65/255, G=0/255, B=85/255
+ Rf = 65 / 255
+ Bf = 85 / 255
+ pdict = {'red': ((0.0, Rf, Rf),
+ (1.0, Rf, Rf)),
+ 'green': ((0.0, 0.0, 0.0),
+ (1.0, 0.0, 0.0)),
+ 'blue': ((0.0, Bf, Bf),
+ (1.0, Bf, Bf)),
+ 'alpha': ((0.0, 0.0, 0.0),
+ (1.0, 1.0, 1.0))
+ }
+ # Now we will use this example to illustrate 3 ways of
+ # handling custom colormaps.
+ # First, the most direct and explicit:
+ purple1 = LinearSegmentedColormap('Purple1', pdict)
+
+ # set the figure dimensions
+ fig_width = 1.61 * 3
+ fig_height = 1 * 3
+ fig = plt.figure(figsize=(fig_width, fig_height))
+
+ # set the panel dimensions
+ heat_map_panel_width = fig_width * 0.5
+ heat_map_panel_height = heat_map_panel_width * 0.62
+
+ # find the margins to center the panel in figure
+ fig_left_margin = fig_bottom_margin = (1 / 6)
+
+ # lengthPanel
+ length_panel_width = (1 / 8)
+
+ # the color Bar parameters
+ legend_panel_width = (1 / 24)
+
+ # define padding
+ h_padding = 0.02
+ v_padding = 0.05
+
+ # Set whether to include y-axes in histograms
+ if kwargs["Y_AXES"]:
+ length_bottom_spine = True
+ length_bottom_tick = False
+ length_bottom_label = True
+ qual_left_spine = True
+ qual_left_tick = True
+ qual_left_label = True
+ qual_y_label = 'Count'
+ else:
+ length_bottom_spine = False
+ length_bottom_tick = False
+ length_bottom_label = False
+ qual_left_spine = False
+ qual_left_tick = False
+ qual_left_label = False
+ qual_y_label = None
+
+ panels = []
+
+ # Quality histogram panel
+ qual_panel_left = fig_left_margin + length_panel_width + h_padding
+ qual_panel_width = heat_map_panel_width / fig_width
+ qual_panel_height = length_panel_width * fig_width / fig_height
+ qual_panel = generate_panel(qual_panel_left,
+ fig_bottom_margin,
+ qual_panel_width,
+ qual_panel_height,
+ left_tick_param=qual_left_tick,
+ label_left_tick_param=qual_left_label)
+ panels.append(qual_panel)
+
+ # Length histogram panel
+ length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
+ length_panel_height = heat_map_panel_height / fig_height
+ length_panel = generate_panel(fig_left_margin,
+ length_panel_bottom,
+ length_panel_width,
+ length_panel_height,
+ bottom_tick_param=length_bottom_tick,
+ label_bottom_tick_param=length_bottom_label)
+ panels.append(length_panel)
+
+ # Heat map panel
+ heat_map_panel_left = fig_left_margin + length_panel_width + h_padding
+ heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
+
+ heat_map_panel = generate_panel(heat_map_panel_left,
+ heat_map_panel_bottom,
+ heat_map_panel_width / fig_width,
+ heat_map_panel_height / fig_height,
+ bottom_tick_param=False,
+ label_bottom_tick_param=False,
+ left_tick_param=False,
+ label_left_tick_param=False)
+ panels.append(heat_map_panel)
+ heat_map_panel.set_title(kwargs["title"])
+
+ # Legend panel
+ legend_panel_left = fig_left_margin + length_panel_width + \
+ (heat_map_panel_width / fig_width) + (h_padding * 2)
+ legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
+ legend_panel_height = heat_map_panel_height / fig_height
+ legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
+ legend_panel_width, legend_panel_height,
+ bottom_tick_param = False,
+ label_bottom_tick_param = False,
+ left_tick_param = False,
+ label_left_tick_param = False,
+ right_tick_param = True,
+ label_right_tick_param = True)
+ panels.append(legend_panel)
+
+ # Set min and max viewing window for length
+ if kwargs["plot_maxlen"]:
+ max_plot_length = kwargs["plot_maxlen"]
+ else:
+ max_plot_length = int(np.percentile(df['length'], 99))
+ min_plot_length = kwargs["plot_minlen"]
+
+ # Set length bin sizes
+ if kwargs["lengthbin"]:
+ length_bin_interval = kwargs["lengthbin"]
+ else:
+ # Dividing by 80 is based on what looks good from experience
+ length_bin_interval = int(max_plot_length / 80)
+
+ # length_bins = np.arange(0, max_plot_length, length_bin_interval)
+
+ # Set max and min viewing window for quality
+ if kwargs["plot_maxqual"]:
+ max_plot_qual = kwargs["plot_maxqual"]
+ elif kwargs["kmerdf"]:
+ max_plot_qual = np.ceil(df["numks"].median() * 2)
+ else:
+ max_plot_qual = max(np.ceil(df['meanQual']))
+ min_plot_qual = kwargs["plot_minqual"]
+
+ # Set qual bin sizes
+ if kwargs["qualbin"]:
+ qual_bin_interval = kwargs["qualbin"]
+ elif kwargs["kmerdf"]:
+ qual_bin_interval = 1
+ else:
+ # again, this is just based on what looks good from experience
+ qual_bin_interval = max_plot_qual / 85
+ qual_bins = np.arange(0, max_plot_qual, qual_bin_interval)
+
+ # Generate length histogram
+ generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length,
+ length_bin_interval, hist_horizontal=False,
+ y_label='Read Length', bottom_spine=length_bottom_spine)
+
+ # Generate quality histogram
+ if kwargs["kmerdf"]:
+ generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual,
+ qual_bin_interval, x_label='number of kmers',
+ y_label=qual_y_label, left_spine=qual_left_spine)
+ else:
+ generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual,
+ qual_bin_interval, x_label='Phred Quality',
+ y_label=qual_y_label, left_spine=qual_left_spine)
+
+ # Generate heat map
+ counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual,
+ max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"])
+
+ # Generate legend
+ generate_legend(legend_panel, counts, purple1)
+
+ # inform the user of the plotting window if not quiet mode
+ if not kwargs["QUIET"]:
+ print("""plotting in the following window:
+ {0} <= Q-score (x-axis) <= {1}
+ {2} <= length (y-axis) <= {3}""".format(
+ min_plot_qual, max_plot_qual, min_plot_length, max_plot_length),
+ file=stderr)
+ # Print image(s)
+ if kwargs["BASENAME"] is None:
+ file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0]
+ else:
+ file_base = kwargs["BASENAME"]
+ print_images(
+ file_base,
+ image_formats=kwargs["fileform"],
+ dpi=kwargs["dpi"],
+ no_timestamp = kwargs["no_timestamp"],
+ transparent=kwargs["TRANSPARENT"])
+
+def run(args):
+ if args.kmerdf:
+ df = pd.read_csv(args.kmerdf, header='infer', sep='\t')
+ df["kmers"] = df["kmers"].apply(ast.literal_eval)
+ else:
+ df = parse_fastq_length_meanqual(args.fastq)
+ df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen,
+ args.filt_minqual, args.filt_maxqual)
+ stats(df, args.fastq, False)
+ margin_plot(df=df.dropna(), **vars(args))
diff --git a/pauvre/pauvre_main.py b/pauvre/pauvre_main.py
new file mode 100644
index 0000000..2a5fb91
--- /dev/null
+++ b/pauvre/pauvre_main.py
@@ -0,0 +1,636 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore plotting package
+# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS
+
+import sys
+import os.path
+import argparse
+
+# pauvre imports
+import pauvre.version
+# This class is used in argparse to expand the ~. This avoids errors caused on
+# some systems.
+
+
+class FullPaths(argparse.Action):
+ """Expand user- and relative-paths"""
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ setattr(namespace, self.dest,
+ os.path.abspath(os.path.expanduser(values)))
+
+
+class FullPathsList(argparse.Action):
+ """Expand user- and relative-paths when a list of paths is passed to the
+ program"""
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ setattr(namespace, self.dest,
+ [os.path.abspath(os.path.expanduser(value)) for value in values])
+
+def run_subtool(parser, args):
+ if args.command == 'browser':
+ import pauvre.browser as submodule
+ elif args.command == 'custommargin':
+ import pauvre.custommargin as submodule
+ elif args.command == 'marginplot':
+ import pauvre.marginplot as submodule
+ elif args.command == 'redwood':
+ import pauvre.redwood as submodule
+ elif args.command == 'stats':
+ import pauvre.stats as submodule
+ elif args.command == 'synplot':
+ import pauvre.synplot as submodule
+ # run the chosen submodule.
+ submodule.run(args)
+
+
+class ArgumentParserWithDefaults(argparse.ArgumentParser):
+ def __init__(self, *args, **kwargs):
+ super(ArgumentParserWithDefaults, self).__init__(*args, **kwargs)
+ self.add_argument("-q", "--quiet", help="Do not output warnings to stderr",
+ action="store_true",
+ dest="QUIET")
+
+def main():
+
+ #########################################
+ # create the top-level parser
+ #########################################
+ parser = argparse.ArgumentParser(
+ prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+ parser.add_argument("-v", "--version", help="Installed pauvre version",
+ action="version",
+ version="%(prog)s " + str(pauvre.version.__version__))
+ subparsers = parser.add_subparsers(
+ title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults)
+
+ #########################################
+ # create the individual tool parsers
+ #########################################
+
+ #############
+ # browser
+ #############
+ parser_browser = subparsers.add_parser('browser',
+ help="""an adaptable genome
+ browser with various track types""")
+ parser_browser.add_argument('-c', '--chromosomeid',
+ metavar = "Chr",
+ dest = 'CHR',
+ type = str,
+ help = """The fasta sequence to observe.
+ Use the header name of the fasta file
+ without the '>' character""")
+ parser_browser.add_argument('--dpi',
+ metavar='dpi',
+ default=600,
+ type=int,
+ help="""Change the dpi from the default 600
+ if you need it higher""")
+ parser_browser.add_argument('--fileform',
+ dest='fileform',
+ metavar='STRING',
+ choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
+ 'pdf', 'pgf', 'ps', 'raw', 'rgba',
+ 'svg', 'svgz', 'tif', 'tiff'],
+ default=['png'],
+ nargs='+',
+ help='Which output format would you like? Def.=png')
+ parser_browser.add_argument("--no_timestamp",
+ action = 'store_true',
+ help="""Turn off time stamps in the filename
+ output.""")
+ parser_browser.add_argument('-o', '--output-base-name',
+ dest='BASENAME',
+ help="""Specify a base name for the output file(
+ s). The input file base name is the
+ default.""")
+ parser_browser.add_argument('--path',
+ type=str,
+ help="""Set an explicit filepath for the output.
+ Only do this if you have selected one output type.""")
+ parser_browser.add_argument('-p', '--plot_commands',
+ dest='CMD',
+ nargs = '+',
+ help="""Write strings here to select what
+ to plot. The format for each track is:
+ <type>:<path>:<style>:<options>
+
+ To plot the reference, the format is:
+ ref:<style>:<options>
+
+ Surround each track string with double
+ quotes and a space between subsequent strings.
+ "bam:mybam.bam:depth" "ref:colorful"
+ """)
+ parser_browser.add_argument('--ratio',
+ nargs = '+',
+ type = float,
+ default=None,
+ help="""Enter the dimensions (arbitrary units)
+ to plot the figure. For example a figure that is
+ seven times wider than tall is:
+ --ratio 7 1""")
+ parser_browser.add_argument('-r', '--reference',
+ metavar='REF',
+ dest = 'REF',
+ action=FullPaths,
+ help='The reference fasta file.')
+ parser_browser.add_argument('--start',
+ metavar='START',
+ dest = 'START',
+ type = int,
+ help="""The start position to observe on the
+ fasta file. Uses 1-based indexing.""")
+ parser_browser.add_argument('--stop',
+ metavar='STOP',
+ dest = 'STOP',
+ type = int,
+ help="""The stop position to observe on the
+ fasta file. Uses 1-based indexing.""")
+ parser_browser.add_argument('-T', '--transparent',
+ action='store_false',
+ help="""Specify this option if you DON'T want a
+ transparent background. Default is on.""")
+ parser_browser.set_defaults(func=run_subtool)
+
+ ################
+ # custommargin
+ ################
+ parser_custmar = subparsers.add_parser('custommargin',
+ help='plot custom marginal histograms of tab-delimited files')
+ parser_custmar.add_argument('--dpi',
+ metavar='dpi',
+ default=600,
+ type=int,
+ help="""Change the dpi from the default 600
+ if you need it higher""")
+ parser_custmar.add_argument('--fileform',
+ dest='fileform',
+ metavar='STRING',
+ choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
+ 'pdf', 'pgf', 'ps', 'raw', 'rgba',
+ 'svg', 'svgz', 'tif', 'tiff'],
+ default=['png'],
+ nargs='+',
+ help='Which output format would you like? Def.=png')
+ parser_custmar.add_argument('-i', '--input_file',
+ action=FullPaths,
+ help="""A tab-separated file with a header row
+ of column names.""")
+ parser_custmar.add_argument('--xcol',
+ type=str,
+ help="""The column name of the data to plot on
+ the x-axis""")
+ parser_custmar.add_argument('--ycol',
+ type=str,
+ help="""The column name of the data to plot on
+ the y-axis""")
+ parser_custmar.add_argument('-n', '--no_transparent',
+ action='store_false',
+ help="""Specify this option if
+ you don't want a transparent background. Default
+ is on.""")
+ parser_custmar.add_argument("--no_timestamp",
+ action = 'store_true',
+ help="""Turn off time stamps in the filename
+ output.""")
+ parser_custmar.add_argument('-o', '--output_base_name',
+ default = None,
+ help='Specify a base name for the output file('
+ 's). The input file base name is the '
+ 'default.')
+ parser_custmar.add_argument('--plot_max_y',
+ type=int,
+ help="""Sets the maximum viewing area in the
+ length dimension.""")
+ parser_custmar.add_argument('--plot_max_x',
+ type=float,
+ help="""Sets the maximum viewing area in the
+ quality dimension.""")
+ parser_custmar.add_argument('--plot_min_y',
+ type=int,
+ default=0,
+ help="""Sets the minimum viewing area in the
+ length dimension. Default value = 0""")
+ parser_custmar.add_argument('--plot_min_x',
+ type=float,
+ default=0,
+ help="""Sets the minimum viewing area in the
+ quality dimension. Default value = 0""")
+ parser_custmar.add_argument('--square',
+ action='store_true',
+ help="""changes the hexmap into a square
+ map quantized by ints.""")
+ parser_custmar.add_argument('-t', '--title',
+ type = str,
+ help="""This sets the title for the whole plot.
+ Use --title "Crustacean's DNA read quality"
+ if you need single quote or apostrophe
+ inside title.""")
+ parser_custmar.add_argument('--ybin',
+ type=int,
+ help="""This sets the bin size to use for length.""")
+ parser_custmar.add_argument('--xbin',
+ type=float,
+ help="""This sets the bin size to use for quality""")
+ parser_custmar.add_argument('--add-yaxes',
+ dest='Y_AXES',
+ action='store_true',
+ help='Add Y-axes to both marginal histograms.')
+ parser_custmar.set_defaults(func=run_subtool)
+
+
+ #############
+ # marginplot
+ #############
+ parser_mnplot = subparsers.add_parser('marginplot',
+ help='plot a marginal histogram of a fastq file')
+ parser_mnplot.add_argument('--dpi',
+ metavar='dpi',
+ default=600,
+ type=int,
+ help="""Change the dpi from the default 600
+ if you need it higher""")
+ parser_mnplot.add_argument('-f', '--fastq',
+ metavar='FASTQ',
+ action=FullPaths,
+ help='The input FASTQ file.')
+ parser_mnplot.add_argument('--fileform',
+ dest='fileform',
+ metavar='STRING',
+ choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
+ 'pdf', 'pgf', 'ps', 'raw', 'rgba',
+ 'svg', 'svgz', 'tif', 'tiff'],
+ default=['png'],
+ nargs='+',
+ help='Which output format would you like? Def.=png')
+ parser_mnplot.add_argument('--filt_maxlen',
+ type=int,
+ help="""This sets the max read length filter reads.""")
+ parser_mnplot.add_argument('--filt_maxqual',
+ type=float,
+ help="""This sets the max mean read quality
+ to filter reads.""")
+ parser_mnplot.add_argument('--filt_minlen',
+ type=int,
+ default=0,
+ help="""This sets the min read length to
+ filter reads.""")
+ parser_mnplot.add_argument('--filt_minqual',
+ type=float,
+ default=0,
+ help="""This sets the min mean read quality
+ to filter reads.""")
+ parser_mnplot.add_argument('--kmerdf',
+ type = str,
+ default = None,
+ help = """Pass the filename of a data matrix if
+ you would like to plot read length
+ versus number of kmers in that read. The data matrix
+ is a tab-separated text file with columns
+ "id length numks and kmers", where:
+ <id> = read id
+ <length> = the length of the read
+ <numks> = the number of canonical kmers in the read
+ <kmers> = a list representation of kmers ie ['GAT', 'GTA']""")
+ parser_mnplot.add_argument('-n', '--no_transparent',
+ dest='TRANSPARENT',
+ action='store_false',
+ help="""Specify this option if
+ you don't want a transparent background. Default
+ is on.""")
+ parser_mnplot.add_argument("--no_timestamp",
+ action = 'store_true',
+ help="""Turn off time stamps in the filename
+ output.""")
+ parser_mnplot.add_argument('-o', '--output_base_name',
+ dest='BASENAME',
+ help='Specify a base name for the output file('
+ 's). The input file base name is the '
+ 'default.')
+ parser_mnplot.add_argument('--plot_maxlen',
+ type=int,
+ help="""Sets the maximum viewing area in the
+ length dimension.""")
+ parser_mnplot.add_argument('--plot_maxqual',
+ type=float,
+ help="""Sets the maximum viewing area in the
+ quality dimension.""")
+ parser_mnplot.add_argument('--plot_minlen',
+ type=int,
+ default=0,
+ help="""Sets the minimum viewing area in the
+ length dimension.""")
+ parser_mnplot.add_argument('--plot_minqual',
+ type=float,
+ default=0,
+ help="""Sets the minimum viewing area in the
+ quality dimension.""")
+ parser_mnplot.add_argument('--lengthbin',
+ metavar='LENGTHBIN',
+ type=int,
+ help="""This sets the bin size to use for length.""")
+ parser_mnplot.add_argument('--qualbin',
+ metavar='QUALBIN',
+ type=float,
+ help="""This sets the bin size to use for quality""")
+ parser_mnplot.add_argument('-t', '--title',
+ metavar='TITLE',
+ default='Read length vs mean quality',
+ help="""This sets the title for the whole plot.
+ Use --title "Crustacean's DNA read quality"
+ if you need single quote or apostrophe
+ inside title.""")
+ parser_mnplot.add_argument('-y', '--add-yaxes',
+ dest='Y_AXES',
+ action='store_true',
+ help='Add Y-axes to both marginal histograms.')
+ parser_mnplot.set_defaults(func=run_subtool)
+
+ #############
+ # redwood
+ #############
+ parser_redwood = subparsers.add_parser('redwood',
+ help='make a redwood plot from a bam file')
+ parser_redwood.add_argument('-d', '--doubled',
+ dest='doubled',
+ choices=['main', 'rnaseq'],
+ default=[],
+ nargs='+',
+ help="""If your fasta file was doubled to map
+ circularly, use this flag or you may encounter
+ plotting errors. Accepts multiple arguments.
+ 'main' is for the sam file passed with --sam,
+ 'rnaseq' is for the sam file passed with --rnaseq""")
+ parser_redwood.add_argument('--dpi',
+ metavar='dpi',
+ default=600,
+ type=int,
+ help="""Change the dpi from the default 600
+ if you need it higher""")
+ parser_redwood.add_argument('--fileform',
+ dest='fileform',
+ metavar='STRING',
+ choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
+ 'pdf', 'pgf', 'ps', 'raw', 'rgba',
+ 'svg', 'svgz', 'tif', 'tiff'],
+ default=['png'],
+ nargs='+',
+ help='Which output format would you like? Def.=png')
+ parser_redwood.add_argument('--gff',
+ metavar='gff',
+ action=FullPaths,
+ help="""The input filepath for the gff annotation
+ to plot""")
+ parser_redwood.add_argument('-I', '--interlace',
+ action='store_true',
+ default=False,
+ help="""Interlace the reads so the pileup plot
+ looks better. Doesn't work currently""")
+ parser_redwood.add_argument('-i', '--invert',
+ action='store_true',
+ default=False,
+ help="""invert the image so that it looks better
+ on a dark background. DOESN'T DO ANYTHING.""")
+ parser_redwood.add_argument('-L', '--log',
+ action='store_true',
+ default=False,
+ help="""Plot the RNAseq track with a log scale""")
+ parser_redwood.add_argument('-M', '--main_bam',
+ metavar='mainbam',
+ action=FullPaths,
+ help="""The input filepath for the bam file to
+ plot. Ideally was plotted with a fasta file that
+ is two copies of the mitochondrial genome
+ concatenated. This should be long reads (ONT, PB)
+ and will be displayed in the interior of the
+ redwood plot.""")
+ parser_redwood.add_argument("--no_timestamp",
+ action = 'store_true',
+ help="""Turn off time stamps in the filename
+ output.""")
+ parser_redwood.add_argument('-o', '--output-base-name',
+ dest='BASENAME',
+ help='Specify a base name for the output file('
+ 's). The input file base name is the '
+ 'default.')
+ parser_redwood.add_argument('--query',
+ dest='query',
+ default=['ALNLEN >= 10000', 'MAPLEN < reflength'],
+ nargs='+',
+ help='Query your reads to change plotting options')
+ parser_redwood.add_argument('-R', '--rnaseq_bam',
+ metavar='rnabam',
+ action=FullPaths,
+ help='The input filepath for the rnaseq bam file to plot')
+ parser_redwood.add_argument('--small_start',
+ dest='small_start',
+ choices=['inside', 'outside'],
+ default='inside',
+ help="""This determines where the shortest of the
+ filtered reads will appear on the redwood plot:
+ on the outside or on the inside? The default
+ option puts the longest reads on the outside and
+ the shortest reads on the inside.""")
+ parser_redwood.add_argument('--sort',
+ dest='sort',
+ choices=['ALNLEN', 'TRULEN', 'MAPLEN', 'POS'],
+ default='ALNLEN',
+ help="""What value to use to sort the order in
+ which the reads are plotted?""")
+ parser_redwood.add_argument('--ticks',
+ type = int,
+ nargs = '+',
+ default = [0, 10, 100, 1000],
+ help="""Specify control for the number of ticks.""")
+ parser_redwood.add_argument('-T', '--transparent',
+ action='store_false',
+ help="""Specify this option if you DON'T want a
+ transparent background. Default is on.""")
+ parser_redwood.set_defaults(func=run_subtool)
+
+ #############
+ # stats
+ #############
+ parser_stats = subparsers.add_parser('stats',
+ help='outputs stats from a fastq file')
+ parser_stats.add_argument('-f', '--fastq',
+ metavar='FASTQ',
+ action=FullPaths,
+ help='The input FASTQ file.')
+ parser_stats.add_argument('-H', '--histogram',
+ action='store_true',
+ help="""Make a histogram of the read lengths and
+ saves it to a new file""")
+ parser_stats.add_argument('--filt_maxlen',
+ type=int,
+ help="""This sets the max read length filter reads.""")
+ parser_stats.add_argument('--filt_maxqual',
+ type=float,
+ help="""This sets the max mean read quality
+ to filter reads.""")
+ parser_stats.add_argument('--filt_minlen',
+ type=int,
+ default=0,
+ help="""This sets the min read length to
+ filter reads.""")
+ parser_stats.add_argument('--filt_minqual',
+ type=float,
+ default=0,
+ help="""This sets the min mean read quality
+ to filter reads.""")
+
+ parser_stats.set_defaults(func=run_subtool)
+
+ #############
+ # synplot
+ #############
+ parser_synplot = subparsers.add_parser('synplot',
+ help="""make a synteny plot from a gff
+ file, protein alignment, and partition
+ file""")
+ parser_synplot.add_argument('--aln_dir',
+ metavar='aln_dir',
+ action=FullPaths,
+ help="""The directory where all the fasta
+ alignments are contained.""")
+ parser_synplot.add_argument('--center_on',
+ type=str,
+ default=None,
+ help="""Centers the plot around the gene that
+ you pass as an argument. For example, if there
+ is a locus called 'COI' in the gff file and in
+ the alignments directory, center using:
+ --center_on COI""")
+ parser_synplot.add_argument('--dpi',
+ metavar='dpi',
+ default=600,
+ type=int,
+ help="""Change the dpi from the default 600
+ if you need it higher""")
+ parser_synplot.add_argument('--fileform',
+ dest='fileform',
+ choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
+ 'pdf', 'pgf', 'ps', 'raw', 'rgba',
+ 'svg', 'svgz', 'tif', 'tiff'],
+ default=['png'],
+ nargs='+',
+ help='Which output format would you like? Def.=png')
+ parser_synplot.add_argument('--gff_paths',
+ metavar='gff_paths',
+ action=FullPathsList,
+ nargs='+',
+ help="""The input filepath. for the gff annotation
+ to plot. Individual filepaths separated by spaces.
+ For example, --gff_paths sp1.gff sp2.gff sp3.gff""")
+ parser_synplot.add_argument('--gff_labels',
+ metavar='gff_labels',
+ type=str,
+ nargs='+',
+ help="""In case the gff names and sequence names
+ don't match, change the labels that will appear
+ over the text.""")
+ parser_synplot.add_argument("--no_timestamp",
+ action = 'store_true',
+ help="""Turn off time stamps in the filename
+ output.""")
+ parser_synplot.add_argument('--optimum_order',
+ action='store_true',
+ help="""If selected, this doesn't plot the
+ optimum arrangement of things as they are input
+ into gff_paths. Instead, it uses the first gff
+ file as the top-most sequence in the plot, and
+ reorganizes the remaining gff files to minimize
+ the number of intersections.""")
+ parser_synplot.add_argument('-o', '--output-basename',
+ dest='BASENAME',
+ help='Specify a base name for the output file('
+ 's). The input file base name is the '
+ 'default.')
+ parser_synplot.add_argument('--ratio',
+ nargs = '+',
+ type = float,
+ default=None,
+ help="""Enter the dimensions (arbitrary units)
+ to plot the figure. For example a figure that is
+ seven times wider than tall is:
+ --ratio 7 1""")
+ parser_synplot.add_argument('--sandwich',
+ action='store_true',
+ default=False,
+ help="""Put an additional copy of the first gff
+ file on the bottom of the plot for comparison.""")
+ parser_synplot.add_argument('--start_with_aligned_genes',
+ action='store_true',
+ default=False,
+ help="""Minimizes the number of intersections
+ but only selects combos where the first gene in
+ each sequence is aligned.""")
+ parser_synplot.add_argument('--stop_codons',
+ action='store_true',
+ default=True,
+ help="""Performs some internal corrections if
+ the gff annotation includes the stop
+ codons in the coding sequences.""")
+ parser_synplot.add_argument('-T', '--transparent',
+ action='store_false',
+ help="""Specify this option if you DON'T want a
+ transparent background. Default is on.""")
+ parser_synplot.set_defaults(func=run_subtool)
+
+ #######################################################
+ # parse the args and call the selected function
+ #######################################################
+
+ args = parser.parse_args()
+
+ # If there were no args, print the help function
+ if len(sys.argv) == 1:
+ parser.print_help()
+ sys.exit(1)
+
+ # If there were no args, but someone selected a program,
+ # print the program's help.
+ commandDict = {'browser': parser_browser.print_help,
+ 'custommargin': parser_custmar.print_help,
+ 'marginplot': parser_mnplot.print_help,
+ 'redwood': parser_redwood.print_help,
+ 'stats': parser_stats.print_help,
+ 'synplot': parser_synplot.print_help}
+
+ if len(sys.argv) == 2:
+ commandDict[args.command]()
+ sys.exit(1)
+
+ if args.QUIET:
+ logger.setLevel(logging.ERROR)
+
+ try:
+ args.func(parser, args)
+ except IOError as e:
+ if e.errno != 32: # ignore SIGPIPE
+ raise
+
+if __name__ == "__main__":
+ main()
diff --git a/pauvre/rcparams.py b/pauvre/rcparams.py
new file mode 100644
index 0000000..a8791b8
--- /dev/null
+++ b/pauvre/rcparams.py
@@ -0,0 +1,25 @@
+# -*- coding: utf-8 -*-
+
+from matplotlib import rcParams
+
+def update_rcParams():
+ # This mpl style is from the UCSC BME163 class.
+ rcParams.update({
+ 'font.size' : 8.0 ,
+ 'xtick.major.size' : 2 ,
+ 'xtick.major.width' : 0.75 ,
+ 'xtick.labelsize' : 8.0 ,
+ 'xtick.direction' : 'out' ,
+ 'ytick.major.size' : 2 ,
+ 'ytick.major.width' : 0.75 ,
+ 'ytick.labelsize' : 8.0 ,
+ 'ytick.direction' : 'out' ,
+ 'xtick.major.pad' : 2 ,
+ 'xtick.minor.pad' : 2 ,
+ 'ytick.major.pad' : 2 ,
+ 'ytick.minor.pad' : 2 ,
+ 'savefig.dpi' : 601 ,
+ 'axes.linewidth' : 0.75 ,
+ "image.composite_image": False ,
+ "pdf.fonttype" : 42,
+ "ps.fonttype" : 42 })
diff --git a/pauvre/redwood.py b/pauvre/redwood.py
new file mode 100644
index 0000000..059440d
--- /dev/null
+++ b/pauvre/redwood.py
@@ -0,0 +1,1047 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+# twitter @conchoecia
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# SAM/BAM todo
+# i/ for loop took 269 seconds ~ 4.48 minutes
+
+#things to do
+
+# This doesn't make much sense. it is perfectly possible that it could
+# double and logbe both (in reference to how histograms are plotted
+
+# - make each layer operate independently
+# - for each of these, make the program figure out the total length either from the GFF or from the bam file
+# - make the error "Your query was too stringent and no reads resulted..." not give a traceback.
+# - raise the proper error if the sam file has no header.
+# - figure out how to update rcParams every time we run a program
+# - Write a better docstring for how plotArc works.
+# - Write a docstring for seqOrder method. I don't remember what it does
+# - write a better function to get the alignment length of the sam/bam file
+# right now it opens the file twice and only gets the length the first time
+# - drop columns by name, not by column number (samFile.drop...)
+# - Here's another that needs to be fixed: samFile.drop(samFile.columns[3], axis=1, inplace=True)
+# - args
+# - get the filename from args
+# - set up a double-mapped mode to wrap reads around the 0th coordinate for
+# circular assemblies
+# - Make the r-dist something that the user can change.
+# Getting Everything on the Same Track
+# - Make a list of features to plot `plot_order = []` or something like that
+# - First, go through the GFF features and come up with all of the things that
+# AREN'T tRNAs that overlap. Store each set of overlaps as its own set of
+# features. For things with no overlap, add those to the `plot_order` alone
+# - Iterate through all elements of `plot_order`, if all elements are forward
+# (start < stop), then draw the element at the end first with no modification,
+# then for every subsequent element draw a white border around the arrow.
+# The element same element should have the start of its arrow drawn to 1
+# degree before the start of the next one. This will make a chevron pattern
+# That will show both the start and stop of both reads accurately.
+# - If both elements are reverse, then do the same thing, but in reverse
+# - If one element is in reverse, but another is forward, then make all the
+# elements in the set half-width since there are too many possible
+# combinations to code reliably.
+
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+from matplotlib import rcParams
+import platform
+import itertools
+import pandas as pd
+import numpy as np
+import scipy as sp
+import time
+import os, sys
+import warnings
+
+# import the pauvre rcParams
+import pauvre.rcparams as rc
+from pauvre.functions import GFFParse, print_images
+from pauvre.bamparse import BAMParse
+
+# following this tutorial to install helvetica
+# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
+global hfont
+hfont = {'fontname':'Helvetica'}
+
+def plotArc(start_angle, stop_angle, radius, width, **kwargs):
+ """ write a docstring for this function"""
+ numsegments = 100
+ theta = np.radians(np.linspace(start_angle+90, stop_angle+90, numsegments))
+ centerx = 0
+ centery = 0
+ x1 = -np.cos(theta) * (radius)
+ y1 = np.sin(theta) * (radius)
+ stack1 = np.column_stack([x1, y1])
+ x2 = -np.cos(theta) * (radius + width)
+ y2 = np.sin(theta) * (radius + width)
+ stack2 = np.column_stack([np.flip(x2, axis=0), np.flip(y2,axis=0)])
+ #add the first values from the first set to close the polygon
+ np.append(stack2, [[x1[0],y1[0]]], axis=0)
+ arcArray = np.concatenate((stack1,stack2), axis=0)
+ return patches.Polygon(arcArray, True, **kwargs), ((x1, y1), (x2, y2))
+
+def fix_query_reflength(sequence_length, queries, doubled):
+ """
+ arguments:
+ <sequence_length> This is the reference fasta length. It should be 2x the actual
+ length of the reference since this program takes a sam file from
+ a concatenated reference.
+ <queries> This is a list of SQL-type query strings. This is generated
+ from argparse.
+
+ purpose:
+ This function takes in a list of queries to use for read filtering
+ for the redwood plot. It is often not advisable to plot all mapped reads
+ since many of them are too small relative to the reference length. Also,
+ the point of a death star plot is to show continuity of a circular
+ reference, so short reads aren't very helpful there either.
+
+ Currently, this function only recognizes the keyword argument 'reflength'.
+ """
+ if not doubled:
+ sequence_length = int(sequence_length * 2)
+ for i in range(len(queries)):
+ if 'reflength' in queries[i].split():
+ queries[i] = queries[i].replace('reflength', str(int(sequence_length/2)))
+
+def cust_log(base, val):
+ try:
+ #val = np.log(val)/np.log(base)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ val = np.log(val)
+ except:
+ val = 0
+ if np.isinf(val):
+ val = 0
+ return val
+
+def plot_histo(panel, args, angleMap, width_map, start_radius, track_radius,
+ thisLog = False, double = False, ticks = False):
+ """Plots a histogram given a width map. Width map must be the true length
+ of the circular genome"""
+ myPatches = []
+ if double:
+ #plot a line in the middle if doubled to distinguish the center
+ mid_radius = start_radius+((track_radius - (track_radius/100))/2)
+ arc, arcArray = plotArc(start_angle=0, stop_angle=360,
+ radius=mid_radius,
+ width=track_radius/100, fc='yellow')
+ myPatches.append(arc)
+ #this is only 1/100 the width of the band
+
+ for i in range(len(width_map)):
+ iStartIndex = i
+ iStopIndex = i + 1
+ iStartAngle = angleMap[iStartIndex]
+ iStopAngle = angleMap[iStopIndex]
+ # This doesn't make much sense. it is perfectly possible that it could
+ # double and logbe both
+ if double:
+ logwidth = -1 * track_radius * (np.log(width_map[i]-9)/np.log(max(width_map))) * 0.495
+
+ log_start_radius = start_radius+((track_radius - (track_radius/100))/2)
+ width = track_radius * (width_map[i]/max(width_map)) * 0.495
+ normal_start_radius = start_radius+((track_radius - (track_radius/100))/2) + (track_radius/100)
+ #first plot the log inside
+ arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
+ radius=log_start_radius,
+ width=logwidth, fc='black')
+ myPatches.append(arc)
+ # now plot the normal outside
+ arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
+ radius=normal_start_radius,
+ width=width, fc='black')
+ else:
+ if thisLog:
+ base = 10
+ width = track_radius * (cust_log(base, width_map[i])/cust_log(base, max(width_map)))
+
+ else:
+ width = track_radius * (width_map[i]/max(width_map))
+ arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
+ radius=start_radius,
+ width=width, fc='black')
+ myPatches.append(arc)
+
+ maxx = start_radius
+ maxy = 0
+ kerning = 12
+ if ticks != None and len(ticks) > 0:
+ # plot the scalebar
+ maxval=max(width_map)
+ xs = []
+ ys = []
+ xend = start_radius
+ value_list = ticks
+ for value in value_list:
+ centerAngle = 45
+ this_radius = start_radius + \
+ (track_radius * (cust_log(10, value)/cust_log(10, maxval)))
+ print("this_radius", this_radius)
+ arc, arcArray = plotArc(start_angle=centerAngle-1, stop_angle=centerAngle+1,
+ radius=this_radius,
+ width=track_radius/25, fc='red')
+ xs.append(arcArray[0][0][-1])
+ ys.append(arcArray[0][1][-1])
+ myPatches.append(arc)
+
+ middle = np.mean(ys) - (kerning/2)
+ # The tick marks should start slightly below the lowest tick
+ # mark on the track itself and extend upward evenly with text.
+ new_ys = [middle-kerning + (kerning * i)
+ for i in range(len(value_list))]
+ for i in range(len(value_list)):
+ xlist = [xs[i], xend]
+ ylist = [ys[i], new_ys[i]]
+ # plots the lines between the markers and the labels
+ panel.plot(xlist, ylist, ls='--', color='red', lw=0.75)
+ # plots the text labels
+ panel.text(xend, new_ys[i],
+ value_list[i], fontsize = 12,
+ ha='left', va='center',
+ color = 'black', **hfont)
+ #plots the title
+ panel.text(start_radius, new_ys[-1]+kerning,
+ "Read Depth ", fontsize = 10,
+ ha='center', va='center',
+ color = 'black', **hfont)
+
+
+ return myPatches, start_radius + track_radius, panel
+
+def plot_reads(args, angleMap, widthDict, samFiledf, start_radius,
+ doubled = False, collapse = False, track_width=False,
+ track_depth = False, thisLog = False):
+ """this function plots the reads in a sam file.
+ Outputs the patches, and the final_radius
+
+ Author:
+ Darrin T. Schultz (github@conchoecia)
+ """
+
+ # there should be a different width dict with log scale
+ #if args.log and thisLog:
+ # widthDict = {'M':0.8, # match
+ # 'I':0.95, # insertion relative to reference
+ # 'D':0.05, # deletion relative to reference
+ # 'N':0.1, # skipped region from the reference
+ # 'S':0.1, # soft clip, not aligned but still in sam file
+ # 'H':0.1, # hard clip, not aligned and not in sam file
+ # 'P':0.1, # padding (silent deletion from padded reference)
+ # '=':0.1, # sequence match
+ # 'X':0.1} # sequence mismatch
+
+ # clean up the df to reset the number of rows, otherwise there might be
+ # errors in the while loop below
+ samFiledf = samFiledf.reset_index()
+
+ append_radius = 0
+ myPatches = []
+ depth_map = []
+ # if we need to collapse the reads so they aren't 1 read per line, keep track
+ # of plotted depth at a position with depth_map
+ if collapse == True:
+ if doubled:
+ plotted_depth_map = [0] * int((len(angleMap)/2))
+ else:
+ plotted_depth_map = [0] * len(angleMap)
+
+ # If we define a track width, then use the max read depth
+ # of the track to figure out the read_width
+ read_width = 1
+ if track_width:
+ read_width = track_width/track_depth
+
+ # This while loops cycles through the the list until everything is plotted
+ # We start at the first index of the file and count down. If we plot it, then
+ # we remove that read from the dataframe. Once we get to the last index,
+ # we attempt to plot and afterward reset the index of the dataframe and
+ # start from the beginning again.
+ i = 0
+ plotted = False
+ #this is used to dtermine where we are during collapse
+ current_row = 1
+ if args.log and thisLog:
+ current_row = 10
+ skip = False
+ # for printing out the progress of plotting,
+ # get the original number of rows to make placeholders
+ original_rownum_charcount = len(str(len(samFiledf)))
+ direction = 'for'
+ print(samFiledf)
+ while len(samFiledf) > 0:
+ stringTuples = samFiledf.loc[i, 'TUPS']
+ mapLen = samFiledf.loc[i, 'MAPLEN']
+ #print("i: {} and maplen: {}".format(i, mapLen))
+ #Subtract one because BAM uses 1-based indexing but plotting uses 0.
+ # I think I could avoid this in the future by changing the parse
+ start_index = samFiledf.loc[i, 'POS'] - 1
+ start_angle= angleMap[start_index]
+ stop_index = 0
+ stop_angle = angleMap[stop_index]
+ if args.log and thisLog:
+ log = 100
+ read_width = track_width * ((cust_log(log, current_row + 1)/cust_log(log, track_depth + 1))\
+ - (cust_log(log, current_row)/cust_log(log, track_depth + 1)))
+ #print("log track depth: {}".format(np.log10(track_depth + 1)))
+ #print("log current row: {}".format(np.log10(current_row + 1)))
+ #now look in the plotted_depth_map to see if there is already a read
+ # that overlaps with the current read we are considering.
+ if collapse:
+ for collapse_i in range(start_index, start_index+mapLen):
+ #print("looking at {} and found {}".format(collapse_i, plotted_depth_map[collapse_i]))
+ if plotted_depth_map[collapse_i] >= current_row:
+ skip = True
+ break
+ #if we don't skip it, we're gonna plot it!
+ if not skip:
+ # Note that we are plotting something here
+ if collapse:
+ for collapse_i in range(start_index, start_index+mapLen):
+ plotted_depth_map[collapse_i] = plotted_depth_map[collapse_i] + 1
+ for tup in stringTuples:
+ if tup[1] == 'I':
+ #If there is an insertion, back up halfway and make plot the
+ # insertion to visually show a "bulge" with too much sequence.
+ # do not advance the start index to resume normal plotting
+ # after the insertion.
+ iStartIndex = start_index-int(tup[0]/2)
+ iStopIndex = iStartIndex + tup[0]
+ iStartAngle = angleMap[iStartIndex]
+ iStopAngle = angleMap[iStopIndex]
+ arc, arcArray = plotArc(start_angle=iStartAngle, stop_angle=iStopAngle,
+ radius=start_radius + append_radius,
+ width=widthDict[tup[1]] * read_width, fc='black')
+ else:
+ stop_index = start_index + tup[0]
+ stop_angle = angleMap[stop_index]
+ arc, arcArray = plotArc(start_angle=start_angle, stop_angle=stop_angle,
+ radius=start_radius + append_radius,
+ width=widthDict[tup[1]] * read_width, fc='black')
+ start_index = stop_index
+ start_angle = angleMap[start_index]
+ myPatches.append(arc)
+ # If we're not collapsing the reads, we just advance one every row
+ if not collapse:
+ append_radius += read_width
+ plotted = True
+ #if we've ploted something, remove that read and reset the indices
+ # since we will stay at the current index
+ samFiledf = samFiledf.drop(i)
+ samFiledf = samFiledf.reset_index()
+ samFiledf = samFiledf.drop('index', 1)
+ print("(countdown until done) rows: {0:0>{num}} dir: {1}\r".format(
+ len(samFiledf), direction, num= original_rownum_charcount), end="")
+ else:
+ #if we weren't able to plot the current read, we will advance the
+ # index and look for another read to plot
+ # I tried to speed up the algorithm here by making the software 'smart'
+ # about looking forward for the next read, but it was ~50% slower
+ i += 1
+ skip = False
+ # we will only reach this condition if we aren't collapsing, since all
+ # the reads will be removed before getting to this point
+ if i >= len(samFiledf):
+ i = 0
+ #Once we've gone around a complete cycle, we can jump up to start
+ # plotting the next row
+ append_radius += read_width
+ current_row += 1
+ #every time we reset, reorganize so that we're now going in the
+ # opposite direction to avoid skewing all the reads in the forward direction
+ if args.interlace:
+ if direction == 'for':
+ bav = {"by":['POS','MAPLEN'], "asc": [False, False]}
+ direction= 'rev'
+ elif direction == 'rev':
+ bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
+ direction = 'for'
+ samFiledf.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
+ samFiledf.reset_index(inplace=True)
+ samFiledf.drop('index', 1, inplace=True)
+ print("\nfinal row is {}".format(current_row))
+ return myPatches, start_radius + append_radius
+
+def redwood(args):
+ """This function controls all the plotting features for the redwood plot.
+ 1) determine the length of the circular fasta sequence length
+ """
+ # get the plotting options specific to this program.
+ rc.update_rcParams()
+ # This stops numpy from printing numbers in scientific notation.
+ np.set_printoptions(suppress=True)
+
+
+ start = time.time()
+ print(args)
+
+ #---------------------------------------------------------------
+ # 1) determine the length of the circular fasta sequence length
+ #_______________________________________________________________
+ main_doubled = True if 'main' in args.doubled else False
+ # It is fine that we are using a global since this will never be manipulated
+ global sequence_length
+ if args.main_bam:
+ samFile = BAMParse(args.main_bam, doubled = main_doubled)
+ sequence_length = samFile.seqlength
+ filename = samFile.filename
+ else:
+ sequence_length = GFFParse(args.gff).seqlen
+ if sequence_length == 0:
+ raise OSError("""You have used a SAM/BAM file with no header. Please add a header to
+ the file.""")
+
+ # this also needs to be changed depending on if it was a concatenated SAM
+ # if doubled = true, then use linspace between 0,720
+ # if doubled = false, then use linspace between 0, 360
+ # on second thought, it might not be necessary to change this value even
+ # for doubled sequences
+ global angleMap
+ if 'main' in args.doubled:
+ angleMap = np.linspace(0,720,sequence_length)
+ else:
+ angleMap = np.linspace(0,360,sequence_length)
+
+ #these are the line width for the different cigar string flags.
+ # usually, only M, I, D, S, and H appear in bwa mem output
+ widthDict = {'M':0.45, # match
+ 'I':0.9, # insertion relative to reference
+ 'D':0.05, # deletion relative to reference
+ 'N':0.1, # skipped region from the reference
+ 'S':0.1, # soft clip, not aligned but still in sam file
+ 'H':0.1, # hard clip, not aligned and not in sam file
+ 'P':0.1, # padding (silent deletion from padded reference)
+ '=':0.1, # sequence match
+ 'X':0.1} # sequence mismatch
+
+ ##################
+ # make the redwood plot
+ ##################
+
+ figWidth = 5
+ figHeight = 5
+
+ # fig1 is an arbitrary name, only one figure currently
+ fig_1 = plt.figure(figsize=(figWidth, figHeight))
+
+ # This is the width and height of the plot in absolute
+ # values relative to the figWidth and figHeight.
+ circleDiameter = 5.0
+
+ # Center to plot
+ leftMargin = (figWidth - circleDiameter)/2
+ bottomMargin = leftMargin
+
+ # There is one panel in this figure that contains
+ # the concentric circles that represent reads. In addition,
+ # the latest version of this program adds the annotation to the exterior
+ panelCircle = plt.axes([leftMargin/figWidth, #left
+ bottomMargin/figHeight, #bottom
+ circleDiameter/figWidth, #width
+ circleDiameter/figHeight #height
+ ],frameon=False)
+ # Some of these are defaults and redundant, but are included
+ # for readability and convenience.
+ panelCircle.tick_params(axis='both',which='both',\
+ bottom='off', labelbottom='off',\
+ left='off', labelleft='off', \
+ right='off', labelright='off',\
+ top='off', labeltop='off')
+
+ # Simply the list in which patches will be stored.
+ myPatches = []
+
+ # Each read occupies a width of radius = 1. The max width occupied by any
+ # type of match to the read is 0.9 (Insertion). The line width of a match
+ # is 0.45. Look at the `widthDict` dictionary above to see the defined
+ # widths. I chose radius = 15 to start with because it looks decent in most
+ # scenarios, but maybe could use some tweaking in future iterations.
+ start_radius_dict = {1: 10,
+ 20: 15,
+ 30: 20,
+ 35: 90}
+ if args.main_bam:
+ print("The number of rows is: {}".format(len(samFile.features)))
+ for key in sorted(start_radius_dict):
+ if len(samFile.features) >= key:
+ radius = start_radius_dict[key]
+ print("Chose radius: {}".format(radius))
+ else:
+ radius = start_radius_dict[30]
+
+ circle_fontsize = 14
+ panelCircle.text(0, 0, "Position\n(bp)", fontsize = circle_fontsize,
+ ha='center', va='center',
+ color = 'black', **hfont)
+
+ # now plot ticks around the interior, as well as the text
+ if 'main' in args.doubled:
+ real_length = int(sequence_length/2)
+ else:
+ real_length = sequence_length
+ this_len = int(real_length / 1000) * 1000
+ for i in range(0, this_len + 1000, 1000):
+ startAngle = angleMap[i] - 0.75
+ stopAngle = angleMap[i] + 0.75
+ this_radius = radius * 0.93
+ this_width = radius * 0.04
+ arc, arcArray = plotArc(start_angle=startAngle, stop_angle=stopAngle,
+ radius=this_radius,
+ width=this_width, fc='black')
+ myPatches.append(arc)
+ # now plot text if (val/1000) % 2 == 0
+ if (i/1000) % 2 == 0:
+ #the 0.98 gives some float
+ x_pos = -np.cos(np.radians(angleMap[i]+90)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius
+ rotation = 0
+ if i == 0:
+ y_pos = np.sin(np.radians(angleMap[i]+90)) * this_radius * 0.98
+ rotation = 0
+ ha = 'center'
+ va = 'top'
+ if (angleMap[i] > 0 and angleMap[i] < 45):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.5)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 + 1.5)) * this_radius
+ rotation = (90 - angleMap[i]) * 0.95
+ ha = 'right'
+ va = 'top'
+ elif (angleMap[i] >= 45 and angleMap[i] < 67.5):
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ va = 'top'
+ elif (angleMap[i] >= 67.5 and angleMap[i] < 90):
+ # subtracted an addtl degree because it looked bad otherwise
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ va = 'top'
+ elif (angleMap[i] >= 90 and angleMap[i] < 112.5):
+ # added an addtl degree because it looked bad otherwise
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.75)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 1.75)) * this_radius
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ #bottom is not good
+ va = 'center'
+ elif (angleMap[i] >= 112.5 and angleMap[i] < 135):
+ # added an addtl degree because it looked bad otherwise
+ x_pos = -np.cos(np.radians(angleMap[i]+90 + 1.25)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 + 1.25)) * this_radius
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ #bottom not good
+ #center really not good
+ va = 'bottom'
+ elif (angleMap[i] >= 135 and angleMap[i] < 157.5):
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ va = 'bottom'
+ elif (angleMap[i] >= 157.5 and angleMap[i] < 180):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
+ rotation = 90 - angleMap[i]
+ ha = 'right'
+ va = 'bottom'
+ elif (angleMap[i] >= 180 and angleMap[i] < 202.5):
+ # added an addtl degree because it looked bad otherwise
+ x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.0)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 + 2.0)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'bottom'
+ elif (angleMap[i] >= 202.5 and angleMap[i] < 225):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 + 2.25)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 + 2.25)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'bottom'
+ elif (angleMap[i] >= 225 and angleMap[i] < 247.5):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.0)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 1.0)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'bottom'
+ elif (angleMap[i] >= 247.5 and angleMap[i] < 270):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 3.0)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 3.0)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'bottom'
+ elif (angleMap[i] >= 270 and angleMap[i] < 292.5):
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'bottom'
+ elif (angleMap[i] >= 292.5 and angleMap[i] < 315):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 + 0.25)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 + 0.25)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'top'
+ elif (angleMap[i] >= 315 and angleMap[i] < 337.5):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 1.6)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 1.6)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'top'
+ elif (angleMap[i] >= 337.5 and angleMap[i] < 360):
+ x_pos = -np.cos(np.radians(angleMap[i]+90 - 5.0)) * this_radius
+ y_pos = np.sin(np.radians(angleMap[i]+90 - 5.0)) * this_radius
+ rotation = 270 - angleMap[i]
+ ha = 'left'
+ va = 'top'
+ print(" angleMap: {} value: {} rotation: {}".format(angleMap[i], i, rotation))
+ text = "{}".format(i)
+ if i == 0:
+ text = "1/\n{}".format(real_length)
+ panelCircle.text(x_pos, y_pos, text, fontsize = circle_fontsize,
+ ha=ha, va=va, color = 'black', rotation=rotation, **hfont)
+ # Now add a legend
+
+ # If the user wants to plot long reads, plot them
+ if args.main_bam:
+ #turn the query string into something usable,
+ # get rid of variables from argparse
+ # Make sure we haven't told the program to not query
+ if 'False' not in args.query:
+ doubled = 'main' in args.doubled
+ fix_query_reflength(sequence_length, args.query, doubled)
+ # string all the queries together
+ queryString = " and ".join(args.query)
+ print("You are using this query string to filter reads:\n'{}'".format(queryString))
+ samFile = samFile.features.query(queryString)
+ if len(samFile) == 0:
+ raise IOError("""Your query was too stringent and no reads resulted. Please try
+ again with a less stringent test. Redwood plotter
+ exiting""")
+
+ # now determine how to sort the reads in the order they will be plotted
+ if args.small_start == 'inside':
+ ascend = True
+ elif args.small_start == 'outside':
+ ascend = False
+ samFile = samFile.sort_values(by=args.sort, ascending=ascend)
+ samFile = samFile.reset_index()
+ # It is necessary to drop the column called index, since reset_index()
+ # adds this.
+ samFile = samFile.drop('index', 1)
+
+ #this plots the central rings from the sam file
+ read_patches, radius = plot_reads(args, angleMap, widthDict, samFile,
+ radius, doubled = True, collapse = False)
+ myPatches = myPatches + read_patches
+
+ # if the user would like to plot the annotation, plot it now. In the future,
+ # allow the user to select the order in which the individual elements are
+ # plotted. Since the annotation should have a fixed proportional size to
+ # the circle independent of the number of plotted reads, define a new
+ # radius context.
+ if args.gff:
+ print("the radius at the end of annotation is: {}".format(radius))
+ panelCircle, gff_patches, gff_radius = plot_gff(args, panelCircle, args.gff, radius)
+ myPatches = myPatches + gff_patches
+ radius = gff_radius
+
+ # it is helpful to be able to plot the RNAseq data along with the annotation.
+ # plot that directly outside the annotation
+ if args.rnaseq_bam:
+ print("in RNAseq")
+ rna_doubled = True if 'rnaseq' in args.doubled else False
+ bamobject = BAMParse(args.rnaseq_bam, doubled = rna_doubled)
+ samFile = bamobject.features
+ track_width = radius * 0.15
+ #read_patches, radius = plot_reads(args, angleMap,
+ # widthDict, samFile, radius,
+ # doubled = rna_doubled,
+ # collapse = True, track_width = track_width,
+ # track_depth = max(bamobject.raw_depthmap),
+ # thisLog = True)
+ radius_orig=radius
+ read_patches, radius, panelCircle = plot_histo(panelCircle, args, angleMap,
+ bamobject.get_depthmap(), radius,
+ track_width,
+ thisLog = True,
+ ticks = args.ticks)
+ myPatches = myPatches + read_patches
+ myPatches.append(arc)
+
+ # The numseqs value is used to determine the viewing dimensions
+ # of the circle we will plot. It is scaled with the number of sequences
+ # that will be plotted. This value should probably be set last to
+ # accommodate other tracks, like annotation.
+ min_radius = int(-5 - np.ceil(radius))
+ max_radius = int(5 + np.ceil(radius))
+ panelCircle.set_xlim([min_radius, max_radius])
+ panelCircle.set_ylim([min_radius, max_radius])
+
+ for patch in myPatches:
+ panelCircle.add_patch(patch)
+
+ end = time.time()
+ print(end - start)
+ # Print image(s)
+ if args.BASENAME is None:
+ file_base = "redwood"
+ else:
+ file_base = args.BASENAME
+ print_images(
+ base_output_name=file_base,
+ image_formats=args.fileform,
+ no_timestamp = args.no_timestamp,
+ dpi=args.dpi,
+ transparent=args.transparent)
+
+def feature_set_direction(feature_set_df):
+ """This function determines if the features in the dataframe passed here are
+ all forward, all reverse, or mixed"""
+ all_pos = all(feature_set_df['strand'] == '+')
+ all_neg = all(feature_set_df['strand'] == '-')
+ if all_pos:
+ return '+'
+ elif all_neg:
+ return '-'
+ else:
+ return 'mixed'
+
+def plot_feature(this_feature, colorMap, start_radius,
+ bar_thickness, direction, this_feature_overlaps_feature):
+ """This plots the track for a feature, and if there is something for
+ 'this_feature_overlaps_feature', then there is special processing to
+ add the white bar and the extra slope for the chevron
+ """
+ myPatches = []
+ if this_feature_overlaps_feature.empty:
+ iStartAngle = angleMap[this_feature['start']]
+ iStopAngle = angleMap[this_feature['stop']] - 2
+ arc, arcArray = plotArc(start_angle=iStartAngle,
+ stop_angle=iStopAngle,
+ radius = start_radius, width=bar_thickness,
+ fc=colorMap[this_feature['featType']])
+ myPatches.append(arc)
+ #this bit plots the arrow triangles for the genes.
+ # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
+ # This resolution hasn't given me any artifacts thus far
+ tStartAngle = angleMap[this_feature['stop']]-2
+ tStopAngle = angleMap[this_feature['stop']]
+ angles = np.linspace(tStartAngle,tStopAngle,100)
+ widths = np.linspace(bar_thickness,0,100)
+ for j in range(len(angles)-1):
+ arc, arcArray = plotArc(start_angle=tStartAngle,
+ stop_angle=angles[j+1],
+ radius=start_radius+(bar_thickness-widths[j+1])/2,
+ width=widths[j+1],
+ fc=colorMap[this_feature['featType']])
+ myPatches.append(arc)
+ else:
+ # First, make a solid white bar
+ iStartAngle = angleMap[this_feature['start']]
+ iStopAngle = angleMap[this_feature_overlaps_feature['start']]
+ arc, arcArray = plotArc(start_angle=iStartAngle,
+ stop_angle=iStopAngle,
+ radius = start_radius, width=bar_thickness,
+ fc=colorMap['spacebar'])
+ myPatches.append(arc)
+ # Now, make the actual color bar
+ iStartAngle = angleMap[this_feature['start']]
+ iStopAngle = angleMap[this_feature_overlaps_feature['start']] - 1
+ arc, arcArray = plotArc(start_angle=iStartAngle,
+ stop_angle=iStopAngle,
+ radius = start_radius, width=bar_thickness,
+ fc=colorMap[this_feature['featType']])
+ myPatches.append(arc)
+ #first plot a little pink bar for the outline
+ tStartAngle = angleMap[this_feature_overlaps_feature['start']]
+ tStopAngle = angleMap[this_feature['stop']]+1
+ angles = np.linspace(tStartAngle,tStopAngle,100)
+ widths = np.linspace(bar_thickness,0,100)
+ for j in range(len(angles)-1):
+ arc, arcArray = plotArc(start_angle=tStartAngle,
+ stop_angle=angles[j+1],
+ radius=start_radius+(bar_thickness-widths[j+1])/2,
+ width=widths[j+1],
+ fc=colorMap['spacebar'])
+ myPatches.append(arc)
+ #this bit plots the arrow triangles for the genes.
+ # Right now it makes each arrow only 1 degree in width and uses 100 segments to plot it.
+ # This resolution hasn't given me any artifacts thus far
+ tStartAngle = angleMap[this_feature_overlaps_feature['start']]-1
+ tStopAngle = angleMap[this_feature['stop']]
+ angles = np.linspace(tStartAngle,tStopAngle,100)
+ widths = np.linspace(bar_thickness,0,100)
+ for j in range(len(angles)-1):
+ arc, arcArray = plotArc(start_angle=tStartAngle,
+ stop_angle=angles[j+1],
+ radius=start_radius+(bar_thickness-widths[j+1])/2,
+ width=widths[j+1],
+ fc=colorMap[this_feature['featType']])
+ myPatches.append(arc)
+ return myPatches
+
+def get_angles(name, center_angle, kerning_angle):
+ num_chars = len(name.strip())
+ if num_chars == 2:
+ start_pos = center_angle
+ stop_pos = center_angle + kerning_angle
+ return (start_pos, stop_pos)
+ if num_chars % 2 == 0:
+ start_pos = center_angle - (kerning_angle/2) - ((num_chars - 1)/2)
+ stop_pos = center_angle + (kerning_angle/2) + ((num_chars - 1)/2)
+ else:
+ start_pos = center_angle - (num_chars/2)
+ stop_pos = center_angle + (num_chars/2)
+ angles = np.arange(start_pos, stop_pos + kerning_angle, kerning_angle)
+ return angles
+
+def plot_gff(args, panelCircle, gff_path, radius):
+ #parse the gff file
+ gffParser = GFFParse(gff_path)
+
+ # Because this size should be relative to the circle that it is plotted next
+ # to, define the start_radius as the place to work from, and the width of
+ # each track.
+ start_radius = radius
+ track_width = radius * 0.15
+
+ colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
+ 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
+ 'ORF':'orange'}
+ augment = 0
+ bar_thickness = 0.9 * track_width
+ # return these at the end
+ myPatches=[]
+ plot_order = []
+ # this for loop relies on the gff features to already be sorted
+ i = 0
+ idone = False
+ # we need to filter out the tRNAs since those are plotted last
+ plottable_features = gffParser.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
+ plottable_features = plottable_features.reset_index(drop=True)
+ while idone == False:
+ print("im in the overlap-pairing while loop i={}".format(i))
+ # look ahead at all of the elements that overlap with the ith element
+ jdone = False
+ j = 1
+ this_set_minimum_index = i
+ this_set_maximum_index = i
+ while jdone == False:
+ print("new i= {} j={} len={}".format(i, j, len(plottable_features)))
+ # first make sure that we haven't gone off the end of the dataframe
+ if i+j == len(plottable_features):
+ if i == len(plottable_features)-1:
+ # this is the last analysis, so set idone to true
+ # to finish after this
+ idone = True
+ # the last one can't be in its own group, so just add it solo
+ these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
+ plot_order.append(these_features.reset_index(drop=True))
+ break
+ jdone == True
+ else:
+ # if the lmost of the next gene overlaps with the rmost of
+ # the current one, it overlaps and couple together
+ if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
+ # note that this feature overlaps with the current
+ this_set_maximum_index = i+j
+ # ... and we need to look at the next in line
+ j += 1
+ else:
+ i += 1 + (this_set_maximum_index - this_set_minimum_index)
+ #add all of the things that grouped together once we don't find any more groups
+ these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,]
+ plot_order.append(these_features.reset_index(drop=True))
+ jdone = True
+
+ for feature_set in plot_order:
+ print(feature_set)
+ direction = feature_set_direction(feature_set)
+ print("direction = {}".format(direction))
+ if direction == '+':
+ for i in range(len(feature_set)-1, -1,-1):
+ print("inside the plot for loop i = {}".format(i))
+ this_feature = feature_set.loc[i,]
+ print("got this single feature")
+ #For the first element, just plot it normally
+ if i == len(feature_set) - 1:
+ #plot the annotation
+ print("Im in the first plot thing")
+ patches = plot_feature(this_feature, colorMap, start_radius,
+ bar_thickness, direction,
+ pd.Series([]))
+ for each in patches:
+ myPatches.append(each)
+ else:
+ print("now I'm plotting the other one")
+ overlapped_feature = feature_set.loc[i+1,]
+ print("this is the overlapped feature")
+ print(overlapped_feature)
+ patches = plot_feature(this_feature, colorMap, start_radius,
+ bar_thickness, direction,
+ overlapped_feature)
+ for each in patches:
+ myPatches.append(each)
+
+ final_radius = start_radius + track_width
+ # Now we add all of the tRNAs to this to plot, do it last to overlay
+ # everything else
+ tRNAs = gffParser.features.query("featType == 'tRNA'")
+ tRNAs.reset_index(inplace=True, drop = True)
+ print(tRNAs)
+ tRNA_bar_thickness = bar_thickness * (0.8)
+ tRNA_start_radius = start_radius + ((tRNA_bar_thickness * (0.2))/2)
+ #tRNA_bar_thickness = bar_thickness
+ #tRNA_start_radius = start_radius
+ print("sequence_length = {}".format(sequence_length))
+ angle_ranges = []
+ for i in range(0,len(tRNAs)):
+ this_feature = tRNAs.loc[i,]
+ min_angle = angleMap[min(this_feature.loc['start'],this_feature.loc['stop'])]
+ max_angle = angleMap[max(this_feature.loc['start'],this_feature.loc['stop'])]
+ angle_ranges.append((min_angle, max_angle))
+ patches = plot_feature(this_feature, colorMap, tRNA_start_radius,
+ tRNA_bar_thickness, direction,
+ pd.Series([]))
+ for each in patches:
+ myPatches.append(each)
+ print("angle ranges of tRNAs")
+ print(angle_ranges)
+
+ # now plot the text of the all the things. do this last
+ # to cover the previous things we plotted
+ for i in range(len(gffParser.features.index)):
+ if gffParser.features.loc[i, 'featType'] not in ['region', 'source', 'tRNA']:
+ angles = []
+ # to plot the centers, first figure out the center of the annotated
+ # gene by subtracting the end from the beginning
+ name = gffParser.features.loc[i,'name']
+ middle_position = int(gffParser.features.loc[i,'center'])
+ # calculate the radius, because it changes depending on the track.
+ # Remember that the radius is the position at the bottom of the track,
+ # so we must use va=bottom.
+ center_radius = radius + (gffParser.features.loc[i,'track'] * track_width) + (bar_thickness/2)
+ # then, figure out the center angle of that position.
+ # I subtract one since I want to center this for the non-arrow
+ # part of the bar.
+ center_angle = angleMap[middle_position]-2
+ count = 0
+ while count < 1:
+ print()
+ print(name)
+ #figure out how many characters to plot
+ kerning_angle = 2.5
+ print("putting in {} into get angles with center angle = {}, kerning_angle = {}".format(name, center_angle, kerning_angle))
+ angles = get_angles(name, center_angle, kerning_angle)
+ char_min_angle = angles[0]-(kerning_angle / 2)
+ char_max_angle = angles[-1] - (kerning_angle / 2)
+ overlapping_angles = []
+ # for every possible tRNA position, see if the minimum or
+ # max angle lies in the range of the text
+ for tRNA_overlap_range in angle_ranges:
+ for tRNA_position in tRNA_overlap_range:
+ if (char_min_angle < tRNA_position and
+ tRNA_position < char_max_angle):
+ #overlapping_angles contains the tRNA overlapping angles now
+ overlapping_angles.append(tRNA_position)
+ print("character angles {}".format(angles))
+ print("overlapping angles {}".format(overlapping_angles))
+ print("in while loop, center = {}".format(center_angle))
+ if len(overlapping_angles) == 0:
+ count += 1
+ else:
+ min_overlapping_angle = min(overlapping_angles)
+ print("min_overlapping_angle = {}".format(min_overlapping_angle))
+ max_overlapping_angle = max(overlapping_angles)
+ print("max_overlapping_angle = {}".format(max_overlapping_angle))
+ gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
+ gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
+ start_dif = abs(min_overlapping_angle - gene_start_angle)
+ stop_dif = abs(gene_stop_angle - max_overlapping_angle)
+ print("start_dif = {}, stop_dif = {}".format(start_dif, stop_dif))
+ if start_dif > stop_dif:
+ center_angle = min_overlapping_angle - ((min_overlapping_angle - gene_start_angle)/2)
+ else:
+ # the extra -2 is to compensate for the taper. This only
+ # applies to the end of a gene stop. This code won't
+ # work properly for - strand things currently
+ center_angle = gene_stop_angle - 2 - ((gene_stop_angle - max_overlapping_angle)/2)
+ count += 1
+ print("center angle after assignment = {}".format(center_angle))
+ #There is some overlap in the text and we need to
+ # figure out the min overlap and max overlap. Then figure
+ # out if the angle between min and start or max and end is greater
+ # use whichever is greater as the new place to put the text, then calculate the center
+ angles = get_angles(name, center_angle, kerning_angle)
+ text_angle = angles[-1] - angles[0]
+ # this prints all the characters in their positions and rotates
+ print_count = 0
+ for j in range(len(angles)):
+ this_width = gffParser.features.loc[i,'width']
+ if not text_angle > (angleMap[this_width] - 1):
+ if center_angle > 95 and center_angle < 265:
+ rotation = (-1 * center_angle) - 180
+ this_char = name[len(angles) - 1 - j]
+ else:
+ rotation = -1 * center_angle
+ print("name: {} j: {}".format(name, j))
+ this_char = name[j]
+ this_angle = angles[j]
+ # now calculate the absolute x position
+ x_pos = -np.cos(np.radians(this_angle+90)) * center_radius
+ # now calculate the absolute y position
+ y_pos = np.sin(np.radians(this_angle+90)) * center_radius
+ #rotate the text if necessary
+ panelCircle.text(x_pos, y_pos, this_char, fontsize=10,
+ ha='center', va='center',
+ rotation = rotation,
+ family = 'monospace',
+ color = 'white')
+ # this block handles the case where the text is too small to put
+ # parallel with the mitochondrial circle, so it is perpindicular
+ else:
+ if print_count == 0:
+ if gffParser.features.loc[i,'strand'] == '+':
+ gene_start_angle = angleMap[gffParser.features.loc[i,'start']]
+ gene_stop_angle = angleMap[gffParser.features.loc[i,'stop']]
+ center_angle = gene_stop_angle - ((gene_stop_angle-gene_start_angle)/2)
+ # now calculate the absolute x position
+ x_pos = -np.cos(np.radians(center_angle+90)) * center_radius
+ # now calculate the absolute y position
+ y_pos = np.sin(np.radians(center_angle+90)) * center_radius
+ if (center_angle > 0 and center_angle < 180):
+ rotation = -1 * center_angle + 90
+ else:
+ rotation = -1 * center_angle - 90
+ panelCircle.text(x_pos, y_pos, name, fontsize=6,
+ ha='center', va='center',
+ rotation = rotation,
+ family = 'monospace',
+ color = 'white')
+
+ print_count += 1
+
+ return panelCircle, myPatches, final_radius
+
+def run(args):
+ redwood(args)
diff --git a/pauvre/stats.py b/pauvre/stats.py
new file mode 100644
index 0000000..d1cdbdb
--- /dev/null
+++ b/pauvre/stats.py
@@ -0,0 +1,250 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore PhD student's plotting package
+# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# TODO: make a function to nicely print out the pandas dataframes
+
+"""
+Program: pauvre stats
+
+example usage/output:
+ - For example, if I want to know the number of bases and reads with AT LEAST
+ PHRED score 5 and AT LEAST length of 500, I can run the program as you
+ see below and look at the cells highlighted with <braces>.
+ - We find that there are 968,653 basepairs contained in 859 reads that
+ fit meanReadQuality >= Q5, readLen >= 500.
+ > pauvre marginplot --fastq miniDSMN15.fastq
+
+ >numReads: 1000
+ >numBasepairs: 1029114
+ >meanLen: 1029.114
+ >medianLen: 875.5
+ >minLen: 11
+ >maxLen: 5337
+ >N50: 1278
+ >L50: 296
+
+ > Basepairs >= bin by mean PHRED and length
+ >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
+ > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
+ > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
+ > et cetera...
+ > Number of reads >= bin by mean Phred+Len
+ >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ > 0 1000 969 865 366 118 22 3 2 1 0
+ > 500 873 <859> 789 347 113 20 3 2 1 0
+ > 1000 424 418 396 187 62 11 1 1 1 0
+ > et cetera...
+"""
+
+
+from pauvre.functions import parse_fastq_length_meanqual
+import os
+import pandas as pd
+import numpy as np
+
+
+def stats(df, fastqName, histogram=False):
+ """
+ arguments:
+ <df>
+ - a pandas dataframe with cols ['length', 'meanQual'] that contains read
+ length and quality information
+ <fastqName>
+ - just the name of the fastq file. This is used for printing out headers
+ for the data tables.
+
+ purpose:
+ Calculate and output various stats about this fastq file. Currently
+ this program reports:
+ - Total number of reads
+ - Total number of basepairs
+ - mean
+ - median
+ - min
+ - max
+ - N50
+ - A table of num basepairs filtered by min mean PHRED and length
+ - A table of num reads filtered by the same as above ^
+
+ example usage/output:
+ - For example, if I want to know the number of bases and reads with AT LEAST
+ PHRED score 5 and AT LEAST length of 500, I can run the program as you
+ see below and look at the cells highlighted with <braces>.
+ - We find that there are 968,653 basepairs contained in 859 reads that
+ fit meanReadQuality >= Q5, readLen >= 500.
+ > pauvre marginplot --fastq miniDSMN15.fastq
+
+ >numReads: 1000
+ >numBasepairs: 1029114
+ >meanLen: 1029.114
+ >medianLen: 875.5
+ >minLen: 11
+ >maxLen: 5337
+ >N50: 1278
+ >L50: 296
+
+ > Basepairs >= bin by mean PHRED and length
+ >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ > 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
+ > 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
+ > 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
+ > et cetera...
+ > Number of reads >= bin by mean Phred+Len
+ >minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
+ > 0 1000 969 865 366 118 22 3 2 1 0
+ > 500 873 <859> 789 347 113 20 3 2 1 0
+ > 1000 424 418 396 187 62 11 1 1 1 0
+ > et cetera...
+ """
+ fastqBase = os.path.basename(fastqName)
+
+ analysis_sizes = [0, 1000, 5000, 10000]
+ print_string = ""
+ for this_size in analysis_sizes:
+ these_lengths = df.loc[df["length"] >= this_size, "length"]
+ if len(these_lengths) > 0:
+ print_string += "# Fastq stats for {}, reads >= {}bp\n".format(fastqBase, this_size)
+ print_string += "numReads: {}\n".format(len(these_lengths))
+ print_string += "%totalNumReads: {0:.2f}\n".format((len(these_lengths) / len(df)) * 100)
+ print_string += "numBasepairs: {}\n".format(sum(these_lengths))
+ print_string += "%totalBasepairs: {0:.2f}\n".format(
+ (sum(these_lengths) / sum(df["length"])) * 100)
+ print_string += "meanLen: {}\n".format(np.mean(these_lengths))
+ print_string += "medianLen: {}\n".format(np.median(these_lengths))
+ print_string += "minLen: {}\n".format(min(these_lengths))
+ maxLen = max(these_lengths)
+ print_string += "maxLen: {}\n".format(maxLen)
+
+ # calculate the N50
+ fiftypercent = 0.5 * sum(these_lengths)
+ lenSum = 0
+ count = 0
+ for val in sorted(these_lengths, reverse=True):
+ lenSum += val
+ count += 1
+ if lenSum >= fiftypercent:
+ print_string += "N50: {}\n".format(int(val))
+ print_string += "L50: {}\n".format(count)
+ break
+ print_string += "\n"
+
+ print_string += lengthQual_table(df)
+
+ if histogram: # now make a histogram with read lengths
+ histogram_lengths(df["length"], fastqBase.split('.')[0])
+ print(print_string)
+
+
+def lengthQual_table(df):
+ """Create a table with lengths/basepairs on columns and qualities on rows"""
+ # This block calculates the number of length bins for this data set
+ lengthBinList = []
+ size_map = [(1000, 250),
+ (10000, 500),
+ (40000, 1000),
+ (100000, 5000),
+ (500000, 20000),
+ (1000000, 50000),
+ (10000000000, 100000)]
+ # first, figure out where we will start the table
+ minlen = min(df["length"])
+ current_val = 0
+ firstDone = False
+ for this_max_size, this_size_step in size_map:
+ for this_bin in range(current_val, this_max_size, this_size_step):
+ if minlen < this_bin:
+ if not firstDone:
+ lengthBinList.append(prev)
+ firstDone = True
+ lengthBinList.append(this_bin)
+ prev = this_bin
+ current_val = this_max_size
+ # now figure out the largest bin
+ maxLen = df["length"].max()
+ first_index_gt_maxLen = next(i for i, v in enumerate(lengthBinList) if v > maxLen) + 1
+ lengthBinList = lengthBinList[0:first_index_gt_maxLen]
+
+ qualBinList = []
+ increment_by = 1
+ while len(qualBinList) == 0 or len(qualBinList) > 15:
+ # now set up the bins for mean PHRED
+ minQual = int(np.floor(min(df["meanQual"])))
+ maxQual = int(np.ceil(max(df["meanQual"])))
+ qualBinList = list(np.arange(minQual, maxQual + increment_by, increment_by))
+ increment_by += 0.25
+
+ # now make a table of read lengths
+ bpTots = []
+ readnumTots = []
+ for row in range(len(lengthBinList)):
+ dataNums = []
+ readNums = []
+ for column in range(len(qualBinList)):
+ thisQuery = df.query("length >= {} and meanQual >= {}".format(
+ lengthBinList[row], qualBinList[column]))
+ dataNums.append(sum(thisQuery['length']))
+ readNums.append(len(thisQuery.index))
+ bpTots.append(dataNums)
+ readnumTots.append(readNums)
+
+ tables = {"Basepairs >= bin by mean PHRED and length": bpTots,
+ "Number of reads >= bin by mean Phred+Len": readnumTots}
+ print_table = ""
+ for key in sorted(tables):
+ # make a dataframe of our basepair distribution table
+ dataDf = pd.DataFrame(tables[key], columns=["Q{}".format(x) for x in qualBinList])
+ # add the min lengths as a column
+ dataDf.insert(0, 'minLen', lengthBinList)
+ print_table += pretty_print_table(dataDf, key)
+ return print_table
+
+
+def histogram_lengths(length, name_prefix):
+ """Create a histogram of read counts per length."""
+ counts = length.value_counts().to_frame(name="readCount")
+ counts.index.rename('readLen', inplace=True)
+ counts.sort_index(inplace=True)
+ counts.to_csv("{}.hist.csv".format(name_prefix), index=True)
+
+
+def pretty_print_table(df, title):
+ print_string = ""
+ dataframeStr = df.to_string(index=False)
+ # this is the char width of the whole table printed
+ lendataframeStr = len(dataframeStr.split('\n')[0])
+ # this is the char width of the minlen portion of the printed table
+ minLenLen = len(dataframeStr.split()[0])
+ blank = " " * minLenLen
+ # center the text on this offset as the table title
+ txtoffset = lendataframeStr - minLenLen
+ print_string += "\n{}{:^{offset}}\n".format(
+ blank, title, offset=txtoffset)
+ print_string += dataframeStr + "\n"
+ return print_string
+
+
+def run(args):
+ """This just opens the fastq file and passes the info to the stats() function.
+ This is a wrapper function that is accessed by pauvre_main.
+ Useful since we can call stats() independently from other pauvre programs."""
+ df = parse_fastq_length_meanqual(args.fastq)
+ stats(df, args.fastq, histogram=args.histogram)
diff --git a/pauvre/synplot.py b/pauvre/synplot.py
new file mode 100644
index 0000000..724a187
--- /dev/null
+++ b/pauvre/synplot.py
@@ -0,0 +1,578 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# TODO
+# import the pauvre rcParams
+# Cleanup everything
+
+import pandas as pd
+pd.set_option('display.max_columns', 500)
+pd.set_option('display.width', 1000)
+import numpy as np
+import os
+import pauvre.rcparams as rc
+from pauvre.functions import GFFParse, print_images, timestamp
+from pauvre import gfftools
+from pauvre.lsi.lsi import intersection
+import progressbar
+import platform
+import sys
+import time
+import warnings
+
+# for the shuffling algorithm
+from itertools import product
+
+# Biopython stuff
+from Bio import SeqIO
+import Bio.SubsMat.MatrixInfo as MI
+
+# following this tutorial to install helvetica
+# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
+global hfont
+hfont = {'fontname':'Helvetica'}
+
+import matplotlib
+matplotlib.use('agg')
+import matplotlib.pyplot as plt
+from matplotlib.colors import LinearSegmentedColormap, Normalize
+import matplotlib.patches as patches
+
+def shuffle_optimize_gffs(args, GFFs):
+ """This function takes in a list of GFF objects and reshuffles the
+ individual files such that the resulting sequence of GFF files has
+ the minimum number of intersections when plotting synteny
+
+ if args.optimum_order, then the program will find the global minimum
+ arrangement using the first GFF file as the anchor.
+
+ if not args.optimum_order, then the program will find the local minimum
+ shuffle between every input pair of GFF files to plot in the best way possible
+ given the input order.
+
+ returns a list of GFF files from which the user can calculate plotting coordinates
+ """
+ # we use the first-input gff as the topmost sequence,
+ # and then find the best synteny match for the remaining sequences
+ shuffled_gffs = []
+ if args.optimum_order:
+ firstgff = GFFs[0]
+ # save the first gff file unadultered
+ shuffled_gffs.append(firstgff)
+ nextgffs = GFFs[1:]
+ while len(nextgffs) > 0:
+ obs_list = []
+ for i in range(len(nextgffs)):
+ # every observation will be stored here as a tuple.
+ # zeroth element is the num intersections with the current gff
+ # first element is the index of nextgffs
+ # second element is the GFF object
+ shuffles = nextgffs[i].shuffle()
+ for k in range(len(shuffles)):
+ shuf = shuffles[k]
+ coords = firstgff.couple(shuf, this_y = 0, other_y = 1)
+ num_inters = len(intersection(coords))
+ obs_list.append((num_inters, i, shuf))
+ #print(obs_list[-1])
+ intersections, gffixs, shufs = zip(*obs_list)
+ # get the index of the shuffled gff with the least number of
+ # intersections to the current one against which we are comparing
+ #print("intersections", intersections)
+ selected_ix = intersections.index(min(intersections))
+ # save this gff to shuffled gffs to use later for plotting
+ shuffled_gffs.append(shufs[selected_ix])
+ # remove the origin of the shuffled gff from nextgffs
+ del nextgffs[gffixs[selected_ix]]
+ #print("global minimum was {} intersections".format(min(intersections)))
+ # now update the firstgff to the latest shuffled one we collected
+ firstgff = shufs[selected_ix]
+ # plot the gff files in the order in which you input them,
+ # but shuffle them to the order with least intersections
+ else:
+ # first we need to find the best arrangement by finding the combinations
+ # that share the most unique genes
+ genes_series = [GFFs[i].get_unique_genes() for i in range(len(GFFs))]
+ combinations_indices = [0]
+ remaining_indices = list(range(1, len(GFFs)))
+ done = False
+ biggest_intersection_index = -1
+ biggest_intersection_value = 0
+ current_remaining_indices_index = 0
+ while not done:
+ #get the len of the intersection
+ #print("combinations_indices: {}".format(combinations_indices))
+ #print("current_remaining_indices_index: {}".format(current_remaining_indices_index))
+ #print("remaining_indices[current_remaining_indices_index]: {}".format(remaining_indices[current_remaining_indices_index]))
+ #print("genes_series[remaining_indices[current_remaining_indices_index]]: {}".format(genes_series[remaining_indices[current_remaining_indices_index]]))
+ this_intersection_value = len(genes_series[combinations_indices[-1]] &\
+ genes_series[remaining_indices[current_remaining_indices_index]])
+ if this_intersection_value > biggest_intersection_value:
+ biggest_intersection_value = this_intersection_value
+ biggest_intersection_index = current_remaining_indices_index
+ if current_remaining_indices_index < len(remaining_indices)-1:
+ current_remaining_indices_index += 1
+ else:
+ combinations_indices.append(remaining_indices[biggest_intersection_index])
+ del remaining_indices[biggest_intersection_index]
+ biggest_intersection_value = 0
+ current_remaining_indices_index = 0
+ biggest_intersection_index = -1
+ if len(remaining_indices) == 0:
+ done = True
+ # The best order of genes with the most shared genes
+ #I don't know if this is really that useful though since many species will overlap.
+ # In a future implementation of this program it might be necessary to do sub-sorting of this list to get the lest number of line intersections
+ print("The best gene combination is {}".format(combinations_indices))
+ # now we rearrange the GFFs to the best order
+ new_GFFs = [GFFs[i] for i in combinations_indices]
+ # If we're adding another copy of the top one, add it here before shuffling
+ if args.sandwich:
+ new_GFFs.append(new_GFFs[0])
+ shuffles = [new_GFFs[i].shuffle() for i in range(len(new_GFFs))]
+ #print([len(shuffles[i]) for i in range(len(shuffles))])
+ cumulative_least_shuffled_value = 999999999999999999999999999999999999
+ bar = progressbar.ProgressBar()
+ for combination in bar(list(product(*shuffles))):
+ num_intersections = []
+ #I have no idea what this list comprehension does anymore.
+ first_genes = [str(combination[i].features[combination[i].features['featType'].isin(['gene', 'rRNA', 'CDS', 'tRNA'])]['name'].head(n=1)).split()[1] for i in range(len(combination))]
+ # skip to the next iteration if all the genes aren't the same
+ if args.start_with_aligned_genes and len(set(first_genes)) != 1:
+ continue
+ for i in range(len(new_GFFs) - 1):
+ j = i + 1
+ #figure out the best shuffle the next sequence
+ coords = combination[i].couple(combination[j], this_y = i, other_y = j)
+ num_intersections.append(len(intersection(coords)))
+ if sum(num_intersections) < cumulative_least_shuffled_value:
+ shuffled_gffs = combination
+ cumulative_least_shuffled_value = sum(num_intersections)
+ print("\nnew fewest global intersections: {}".format(sum(num_intersections)))
+ return shuffled_gffs
+
+def black_colormap():
+ zeroone = np.linspace(0, 1, 100)
+ colorrange = [(0,0,0,x) for x in zeroone]
+ minblosum = min(MI.blosum62.values())
+ maxblosum = max(MI.blosum62.values())
+ colormap = {i: colorrange[int(translate(i, minblosum, maxblosum, 0, 99))]
+ for i in range(minblosum, maxblosum + 1, 1)}
+ return colormap
+
+def translate(value, left_min, left_max, right_min, right_max):
+ """This code maps values from the left range and interpolates to the
+ corresponding range on the right. This is used to translate the amino acid
+ substition matrix scores to a scale between 0 and 1 for making alphamaps.
+
+ I don't know if this works if the directionality of the ranges are swapped.
+ IE [5, -10] mapped to [0, 1]
+
+ args:
+ <value> - the value in [<left_min>:<left_max>] to scale between
+ [<right_min>:<right_max>]
+ <left_min> - the 'min' of the left (source) range
+ <left_max> - the 'max' of the left (source) range
+ <right_min> - the 'min' of the right (target) range
+ <right_max> the 'max' of the right (target) range
+
+ output:
+ the <value>(float) scaled between <right_min> and <right_max>
+ """
+ # Figure out how 'wide' each range is
+ left_span = left_max - left_min
+ right_span = right_max - right_min
+
+ # Convert the left range into a 0-1 range (float)
+ value_scaled = float(value - left_min) / float(left_span)
+
+ # Convert the 0-1 range into a value in the right range.
+ return right_min + (value_scaled * right_span)
+
+def _samplename_warning(samplename, filename):
+ warnings.warn("""
+ There is a sample in your fasta alignments that
+ does not match the samplenames from the gff filenames. Please
+ rename this samplename to not contain any spaces or underscores.
+ IE for sample 'NC016', '>NC_016_-_ND6' will not work but
+ '>NC016_-_ND6' will work.
+
+ Erroneous name: {}
+ File: {}""".format(samplename, os.path.basename(filename)))
+
+def _samplelength_warning(samplename, genename, featType, gfflen, alnlen):
+ raise Warning("""The length of the protein alignment isn't the same as the
+ length in the GFF file for the sample. Maybe you used a sequence in the
+ alignment that is different from the annotation source? Check if the
+ stop codons are deleted/inserted from either the GFF or alignment. The
+ protein alignment length should be 3 less than the gff length if the
+ stop codons were included in the gff annotation.
+
+ Another possibility is that the RNA that is generating this error has
+ post-transcriptional modifications that complete the stop codon. In
+ this case, you can fudge the stop position in the gff file (increase
+ the value by one or two) to make the plotting script run.
+
+ Sample name: {}
+ feat type: {}
+ gene name: {}
+ gff length: {}
+ aln length: {}""".format(samplename, featType, genename, gfflen, alnlen))
+
+def _nosample_warning(samplename, alngenename, gffnames):
+ raise Warning("""One of the gff files doesn't contain a sequence that the
+ alignment file indicates should be present. Either the alignment file
+ is misnamed or the sequence name in the GFF file is not what you
+ intended.
+
+ Sample name: {}
+ aln gene name: {}
+ gff names: {}""".format(samplename, alngenename, gffnames))
+
+def get_alignments(args):
+ """
+ this reads in all the alignments from the fasta directory.
+ """
+ # This is a dict object with key as
+ filelist = {os.path.splitext(x)[0]:os.path.join(os.path.abspath(args.aln_dir), x)
+ for x in os.listdir(args.aln_dir)
+ if os.path.splitext(x)[1]}
+ print("file list is:")
+ print(filelist)
+ # one entry in seq_dict is:
+ # {seqname: {"featType": featType,
+ # "seqs": {samplename: seq},
+ # "indices": {samplename: indices}}
+ seqs_dict = {}
+ # go through every gene in the genelist
+ for genename in filelist:
+ thisFeatType = ""
+ seqs_list = {}
+ indices_list = {}
+ #print("We found the following samplenames: {}".format(args.samplenames), file = sys.stderr)
+ # this block handles reading in the fasta files to interpret the alignments
+ for record in SeqIO.parse(filelist[genename], "fasta"):
+ # get the sample name and make sure that the sample names match
+ samplename = record.id.replace("_", " ").split()[0]
+ #print("Looking at sample: {}".format(samplename), file=sys.stderr)
+ if samplename not in args.samplenames:
+ #if there's a sequence in the fasta that we did not specify
+ # in the command, ignore that sequence
+ _samplename_warning(samplename, filelist[genename])
+ else:
+ # first, get the sample features
+ samplegff = args.samplenames[samplename].features
+ featType = samplegff.loc[samplegff['name'] == genename, 'featType'].to_string().split()[1]
+ # now we determine if this is a prot alignment or a nucleotide aln
+ if featType in ['gene', 'CDS']:
+ final_seq = "".join([x*3 for x in record.seq])
+ elif featType == 'rRNA':
+ final_seq = str(record.seq)
+ # we now need to verify that the protein sequence is
+ # the length of the gene in the gff file. Do this by removing
+ # gaps in the alignment
+ gfffilt = samplegff.loc[samplegff['name'] == genename, 'width']
+ if len(gfffilt) == 0:
+ _nosample_warning(samplename, genename, list(samplegff['name']))
+ gfflen = int(gfffilt)
+ aln = final_seq.replace("-", "")
+ alnlen = len(aln)
+ if gfflen != alnlen:
+ _samplelength_warning(samplename, genename, featType, gfflen, alnlen)
+ # If we've made it this far without any errors, then incorporate the
+ # indices for each index
+ #print("start_index", start_index)
+ final_indices = [-1] * len(final_seq)
+ # up until the next for loop, here we are determining which
+ # direction to move in. Reverse sequences decrease from the start
+ strand = samplegff.loc[samplegff['name'] == genename, 'strand'].to_string().split()[1]
+ if strand == '+':
+ direction = 1
+ start_index = int(samplegff.loc[samplegff['name'] == genename, 'start'])
+ elif strand == '-':
+ direction = -1
+ start_index = int(samplegff.loc[samplegff['name'] == genename, 'stop'])
+ for i in range(len(final_indices)):
+ if final_seq[i] != '-':
+ final_indices[i] = start_index
+ start_index = start_index + (1 * direction)
+ seqs_list[samplename] = final_seq
+ if args.center_on:
+ center_coord = int(args.samplenames[samplename].features.loc[args.samplenames[samplename].features['name'] == args.center_on, 'center'])
+ indices_list[samplename] = np.array(final_indices) - center_coord
+ else:
+ indices_list[samplename] = final_indices
+ thisFeatType = featType
+ seqs_dict[genename] = {"featType": thisFeatType,
+ "seqs": seqs_list,
+ "indices": indices_list}
+ return seqs_dict
+
+
+def plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
+ featType, matrix, cm, seqname):
+ """This function plots all the lines for each"""
+ print("PLOTTING SYNTENY")
+ myPatches = []
+ colormap = {"COX1": '#c0d9ef',
+ "L": '#e8f1df',
+ "I": '#f7dedc',
+ "16S": '#ff2e00',
+ "12S": '#ffc239',
+ "cal": '#ffff54',
+ "COX2": "#7fce66",
+ "ND2": "#00ae60",
+ "COX3": "#00aeec",
+ "ND1": "#006fbb",
+ "*": "#ffffff",
+ "(": "#ded9c5",
+ "Q": "#ffc294",
+ "?": "#b5a2c4",
+ "ND4": "#968b5a",
+ "ND3": "#00fc65",
+ "ND4L": "#00dcf0",
+ "ND6": "#ff994e",
+ "ND5": "#dc31e6",
+ "X": "#d8d8d8",
+ "G": "#abdce7",
+ "CYTB": "#ff0059"}
+
+ for i in range(len(seq1)):
+ feat1 = seq1[i]
+ feat2 = seq2[i]
+ if feat1 != '-' and feat2 != '-':
+ xs = []
+ ys = []
+ xs.append(ind1[i]) # top left
+ ys.append(y1)
+ xs.append(ind1[i] + 1) # top right
+ ys.append(y1)
+ xs.append(ind2[i] + 1) # bottom right
+ ys.append(y2)
+ xs.append(ind2[i]) #bottom left
+ ys.append(y2)
+ xs.append(ind1[i]) #top left
+ ys.append(y1)
+ alpha = 0.5
+ if featType in ['CDS', 'gene']:
+ try:
+ val = matrix[(feat1, feat2)]
+ except:
+ val = matrix[(feat2, feat1)]
+ color = cm[val]
+ alpha = color[-1]
+ elif featType == 'rRNA':
+ if feat1 != feat2:
+ alpha=0
+ color = colormap[seqname]
+ stack1 = np.column_stack([xs, ys])
+ myPatches.append(patches.Polygon(stack1, closed=True,
+ color = color,
+ alpha = alpha,
+ lw=0))
+ return myPatches
+
+def synplot(args):
+ rc.update_rcParams()
+ print(args)
+ GFFs = []
+ for i in range(len(args.gff_paths)):
+ gffpath = args.gff_paths[i]
+ species = ""
+ if args.gff_labels:
+ species = args.gff_labels[i]
+ GFFs.append(GFFParse(gffpath, args.stop_codons, species))
+
+ # find the optimum shuffling pattern
+ # and add a list of samplenames to the args
+ optGFFs = shuffle_optimize_gffs(args, GFFs)
+ # Make a sandwich for a circular comparison
+ setattr(args, 'samplenames', {gff.samplename:gff for gff in optGFFs})
+
+ # now get the cms and normalize
+ #cms, normalize = gen_colormaps()
+ cm = black_colormap()
+
+ ## and we get the protein alignment objects
+ # {seqname: {"featType": featType,
+ # "seqs": {samplename: seq},
+ # "indices": {samplename: indices}}
+ print("getting alignments")
+ seqs_dict = get_alignments(args)
+ print("done getting alignments")
+ print("seqs_dict is:")
+ print(seqs_dict)
+
+ # set the figure dimensions
+ if args.ratio:
+ figWidth = args.ratio[0] + 1
+ figHeight = args.ratio[1] + 1
+ #set the panel dimensions
+ panelWidth = args.ratio[0]
+ panelHeight = args.ratio[1]
+
+ else:
+ figWidth = 2.5*4
+ figHeight = 5
+ #set the panel dimensions
+ panelWidth = 2.5 * 3
+ panelHeight = 1.75
+
+ figure = plt.figure(figsize=(figWidth,figHeight))
+
+
+ #find the margins to center the panel in figure
+ leftMargin = (figWidth - panelWidth)/2
+ bottomMargin = ((figHeight - panelHeight)/2) + 0.25
+
+ panel0=plt.axes([leftMargin/figWidth, #left
+ bottomMargin/figHeight, #bottom
+ panelWidth/figWidth, #width
+ panelHeight/figHeight]) #height
+ panel0.tick_params(axis='both',which='both',\
+ bottom='on', labelbottom='off',\
+ left='off', labelleft='off', \
+ right='off', labelright='off',\
+ top='off', labeltop='off')
+ #turn off some of the axes
+ panel0.spines['top'].set_visible(False)
+ panel0.spines['right'].set_visible(False)
+ panel0.spines['left'].set_visible(False)
+
+ # {seqname: {"featType": featType,
+ # "seqs": {samplename: seq},
+ # "indices": {samplename: indices}}
+ allPatches = []
+ for seqname in seqs_dict:
+ #go through in order
+ print(seqname)
+ for i in range(0, len(optGFFs) - 1):
+ samplei = optGFFs[i].samplename
+ samplej = optGFFs[i+1].samplename
+ if samplei in seqs_dict[seqname]["seqs"].keys() and\
+ samplej in seqs_dict[seqname]["seqs"].keys():
+ featType = seqs_dict[seqname]["featType"]
+ seq1 = seqs_dict[seqname]["seqs"][samplei]
+ ind1 = seqs_dict[seqname]["indices"][samplei]
+ seq2 = seqs_dict[seqname]["seqs"][samplej]
+ ind2 = seqs_dict[seqname]["indices"][samplej]
+ # this is the top one, just leave it at the actual value since
+ # the base of the annotations start on the integer
+ y1 = len(optGFFs) - 1 - i
+ # this needs to be increased by the bar_thickness (0.9 * track_width in this case, or 0.09)
+ y2 = len(optGFFs) - 2 - i
+ myPatches = plot_synteny(seq1, ind1, seq2, ind2, y1, y2,
+ featType, MI.blosum62, cm, seqname)
+ for patch in myPatches:
+ allPatches.append(patch)
+
+ #print("len allPatches", len(allPatches))
+ # this bit plots the simplified lines in the centers
+ ## first we plot all the lines from the centers of matching genes.
+ ## This is temporary. Or maybe it should be a feature
+ #verts = []
+ #for i in range(len(optGFFs) - 1):
+ # j = i + 1
+ # coords = optGFFs[i].couple(optGFFs[j], this_y = len(optGFFs) - i, other_y = len(optGFFs) - i - 1)
+ # for coord in coords:
+ # verts.append(coord)
+
+ #for vert in verts:
+ # xxyy = list(zip(*vert))
+ # panel0.plot(xxyy[0], xxyy[1])
+ # now we plot horizontal lines showing the length of the mitochondrial sequence
+ maxseqlen = 0
+ # this is a heuristic for trackwidth of what looks good in my experience
+ track_multiplier = 0.08
+ if args.ratio:
+ track_width = track_multiplier * panelWidth
+ else:
+ #0.032 if only 3
+ #0.062 if 6
+ track_width = track_multiplier * panelWidth
+ for i in range(len(optGFFs)):
+ gff = optGFFs[i]
+ #print(" - Plotting panels of {}".format(gff), file = sys.stderr)
+ x_offset = 0
+ #print(" - Detecting if centering is on.".format(gff), file = sys.stderr)
+ if args.center_on:
+ x_offset = -1 * int(gff.features.loc[gff.features['name'] == args.center_on, 'center'])
+ gff = gfftools.x_offset_gff(gff, x_offset)
+ #print(" - Centering is on.".format(gff), file = sys.stderr)
+ #print(" - Plotting horizontal portions with gffplot_horizontal.".format(gff), file = sys.stderr)
+ panel0, patches = gfftools.gffplot_horizontal(
+ figure, panel0, args, gff,
+ track_width = track_width,
+ start_y = len(optGFFs) - i - 1 - ((0.9 * track_width)/2),
+ x_offset = x_offset)
+ #print("{} patches came out of gffplot_horizontal()".format(len(patches)))
+ seq_name = gff.features['sequence'].unique()[0]
+ if args.gff_labels:
+ seq_name = "$\it{{{0}}}$".format(gff.species)
+ panel0.text(0 + x_offset, len(optGFFs) - i - 1 + (0.18/2),
+ seq_name, fontsize = 12,
+ ha='left', va='bottom',
+ color = 'black',
+ zorder = 100)
+
+ if gff.seqlen > maxseqlen:
+ maxseqlen = gff.seqlen
+ xs = (1 + x_offset, gff.seqlen + x_offset)
+ #ys = [len(optGFFs) - i - 1 + (0.09/2)]*2
+ ys = [len(optGFFs) - i - 1]*2
+
+ #print(" - Plotting lines.".format(gff), file = sys.stderr)
+ panel0.plot(xs, ys, color='black', zorder = -9)
+ #print(" - Adding patches.".format(gff), file = sys.stderr)
+ #print("Right before adding patches there are {} patches.".format(len(patches)))
+ for i in range(len(patches)):
+ patch = patches[i]
+ allPatches.append(patch)
+ for patch in allPatches:
+ panel0.add_patch(patch)
+ panel0.set_xlabel("position (bp)")
+
+ #panel0.set_xlim([-15000, int(np.ceil(maxseqlen/1000)*1000)])
+ panel0.set_ylim([ 0 - ( (track_width/2) * 1.1 ),
+ len(optGFFs) - 1 + ( (track_width/2) * 1.1 )])
+
+ # This removes the text labels from the plot
+ labels = [item.get_text() for item in panel0.get_xticklabels()]
+ empty_string_labels = ['']*len(labels)
+ print(" - Setting tick labels.".format(gff), file = sys.stderr)
+
+ panel0.set_xticklabels(empty_string_labels)
+
+ # Print image(s)
+ print(" - Running print_images.".format(gff), file = sys.stderr)
+ if args.BASENAME is None:
+ file_base = "synteny"
+ else:
+ file_base = args.BASENAME
+ print_images(
+ base=file_base,
+ image_formats=args.fileform,
+ no_timestamp = args.no_timestamp,
+ dpi=args.dpi,
+ transparent=args.transparent)
+
+
+def run(args):
+ synplot(args)
diff --git a/pauvre/tests/__init__.py b/pauvre/tests/__init__.py
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/pauvre/tests/__init__.py
diff --git a/pauvre/tests/test_synplot.py b/pauvre/tests/test_synplot.py
new file mode 100644
index 0000000..186ef0c
--- /dev/null
+++ b/pauvre/tests/test_synplot.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# pauvre - just a pore plotting package
+# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+import os
+import subprocess as sp
+import unittest
+
+class libSeq_test_case(unittest.TestCase):
+ """Tests that different features of synplot work correctly
+ """
+ def setUp(self):
+ # Here we must safely make the test directory if it does
+ # not exist
+ scriptdir = os.path.dirname(os.path.realpath(__file__))
+ testoutputdir = os.path.join(scriptdir, "testresults")
+ if not os.path.exists(testoutputdir):
+ os.makedirs(testoutputdir)
+ # now we see if the specific subdirectory for the program
+ # we are testing exists
+ self.thisoutdir = os.path.join(testoutputdir, "synplot")
+ if not os.path.exists(self.thisoutdir):
+ os.makedirs(self.thisoutdir)
+ # now we change our working directory to the
+ # specific subdirectory for this file
+
+ self.aln_dir = os.path.join(scriptdir, "testdata/alignments/")
+ self.gff1 = os.path.join(scriptdir, "testdata/gff_files/Bf201706.gff")
+ self.gff2 = os.path.join(scriptdir, "testdata/gff_files/JN392469.gff")
+ self.gff3 = os.path.join(scriptdir, "testdata/gff_files/NC016117.gff")
+
+
+ def test_normal_plotting_scenario(self):
+ """This verifies that the LibSeq class is constructed with all of the
+ parameters that are present in the meraculous config files"""
+ os.chdir(self.thisoutdir)
+ thiscommand = """pauvre synplot --aln_dir {0} \
+ --fileform pdf \
+ --gff_paths {1} {2} {3} \
+ --center_on COX1 \
+ --gff_labels "B. forskalii" "P. bachei" "M. leidyi" """.format(
+ self.aln_dir, self.gff1,
+ self.gff2, self.gff3)
+
+ data = sp.Popen(thiscommand, shell = True,
+ stdout=sp.PIPE)
+ print("The data is:", data.communicate()[0].decode())
+ print("The return code is: ", data.returncode)
+ self.assertEqual(0, int(data.returncode))
diff --git a/pauvre/version.py b/pauvre/version.py
new file mode 100644
index 0000000..3a6a9f4
--- /dev/null
+++ b/pauvre/version.py
@@ -0,0 +1,4 @@
+# -*- coding: utf-8 -*-
+
+__version__ = "0.2.3"
+
diff --git a/scripts/test.sh b/scripts/test.sh
new file mode 100644
index 0000000..d419aea
--- /dev/null
+++ b/scripts/test.sh
@@ -0,0 +1,14 @@
+set -ev
+
+git clone https://github.com/wdecoster/nanotest.git
+
+pauvre -h
+
+pauvre marginplot -h
+pauvre marginplot -f nanotest/reads.fastq.gz
+pauvre marginplot -f nanotest/reads.fastq.gz -t "my title" --filt_maxlen 30000
+pauvre marginplot -f nanotest/reads.fastq.gz -y --filt_minqual 7 --fileform svg
+
+pauvre stats -h
+pauvre stats -f nanotest/reads.fastq.gz
+pauvre stats -f nanotest/reads.fastq.gz -H --filt_maxqual 12
diff --git a/setup.cfg b/setup.cfg
new file mode 100644
index 0000000..8bfd5a1
--- /dev/null
+++ b/setup.cfg
@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0
+
diff --git a/setup.py b/setup.py
new file mode 100755
index 0000000..7e38014
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,70 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# pauvre
+# Copyright (c) 2016-2020 Darrin T. Schultz.
+#
+# This file is part of pauvre.
+#
+# pauvre is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# pauvre is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with pauvre. If not, see <http://www.gnu.org/licenses/>.
+
+# Tutorials on how to setup python package here:
+# - http://python-packaging.readthedocs.io/en/latest/testing.html
+# - https://jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/
+
+import os
+from setuptools import setup, find_packages
+
+version_py = os.path.join(os.path.dirname(__file__), 'pauvre', 'version.py')
+version = open(version_py).read().strip().split('=')[-1].replace('"', '').strip()
+
+
+setup(name='pauvre',
+ version=version,
+ description='Tools for plotting Oxford Nanopore and other long-read data.',
+ long_description="""
+ 'pauvre' is a package for plotting Oxford Nanopore and other long read data.
+ The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
+ for similar packages. This package was designed for python 3, but it might work in
+ python 2. You can visit the gitub page for more detailed information here:
+ https://github.com/conchoecia/pauvre
+ """,
+ url='https://github.com/conchoecia/pauvre',
+ author='Darrin Schultz',
+ author_email='dts@ucsc.edu',
+ classifiers=[
+ 'Development Status :: 2 - Pre-Alpha',
+ 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
+ 'Programming Language :: Python :: 3',
+ 'Programming Language :: Python :: 3.6',
+ 'Programming Language :: Python :: 3.7',
+ 'Operating System :: POSIX :: Linux',
+ 'Topic :: Scientific/Engineering :: Bio-Informatics',
+ 'Intended Audience :: Science/Research',
+ ],
+ packages=find_packages(),
+ python_requires='>=3',
+ install_requires=[
+ "matplotlib >= 2.0.2",
+ "biopython >= 1.68",
+ "pandas >= 0.20.1",
+ "numpy >= 1.12.1",
+ "scipy",
+ "scikit-learn",
+ ],
+ entry_points={
+ 'console_scripts': ['pauvre=pauvre.pauvre_main:main'],
+ },
+ zip_safe=False,
+ include_package_data=True)