/* The MIT License Copyright (c) 2023 Genome Research Ltd. Author: Petr Danecek Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ /* GFF parsing code refactored from csq.c Things that would be nice to have - dynamic N_REF_PAD - for stop-lost events (also in frameshifts) report the number of truncated aa's - memory could be greatly reduced by indexing gff (but it is quite compact already) - deletions that go beyond transcript boundaries are not checked at sequence level - alloc tscript->ref in hap_finalize, introduce fa_off_beg:16,fa_off_end:16 - see test/csq/ENST00000573314/insertion-overlap.vcf #1476288882 Read about transcript types here http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html http://www.ensembl.org/info/genome/variation/predicted_data.html https://www.gencodegenes.org/pages/biotypes.html List of supported biotypes antisense IG_C_gene IG_D_gene IG_J_gene IG_LV_gene IG_V_gene lincRNA lncRNA .. generic term for 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic, sense_overlapping macro_lncRNA miRNA misc_RNA Mt_rRNA Mt_tRNA polymorphic_pseudogene processed_transcript protein_coding, mRNA ribozyme rRNA sRNA scRNA scaRNA sense_intronic sense_overlapping snRNA snoRNA TR_C_gene TR_D_gene TR_J_gene TR_V_gene The gff parsing logic We collect features such by combining gff lines A,B,C as follows: A .. gene line with a supported biotype A.ID=~/^gene:/ B .. transcript line referencing A with supported biotype B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/ C .. corresponding CDS, exon, and UTR lines: C[3] in {"CDS","exon","three_prime_UTR","five_prime_UTR"} && C.Parent=~/^transcript:B.ID/ For coding biotypes ("protein_coding" or "polymorphic_pseudogene") the complete chain link C -> B -> A is required. For the rest, link B -> A suffices. The supported consequence types, sorted by impact: splice_acceptor_variant .. end region of an intron changed (2bp at the 3' end of an intron) splice_donor_variant .. start region of an intron changed (2bp at the 5' end of an intron) stop_gained .. DNA sequence variant resulting in a stop codon frameshift_variant .. number of inserted/deleted bases not a multiple of three, disrupted translational frame stop_lost .. elongated transcript, stop codon changed start_lost .. the first codon changed inframe_altering .. combination of indels leading to unchanged reading frame and length inframe_insertion .. inserted coding sequence, unchanged reading frame inframe_deletion .. deleted coding sequence, unchanged reading frame missense_variant .. amino acid (aa) change, unchanged length splice_region_variant .. change within 1-3 bases of the exon or 3-8 bases of the intron synonymous_variant .. DNA sequence variant resulting in no amino acid change stop_retained_variant .. different stop codon start_retained_variant .. start codon retained by indel realignment non_coding_variant .. variant in non-coding sequence, such as RNA gene 5_prime_UTR_variant 3_prime_UTR_variant intron_variant .. reported only if none of the above intergenic_variant .. reported only if none of the above The annotation algorithm. The algorithm checks if the variant falls in a region of a supported type. The search is performed in the following order, until a match is found: 1. idx_cds(gf_cds_t) - lookup CDS by position, create haplotypes, call consequences 2. idx_utr(gf_utr_t) - check UTR hits 3. idx_exon(gf_exon_t) - check for splice variants 4. idx_tscript(tscript_t) - check for intronic variants, RNAs, etc. These regidx indexes are created by parsing a gff3 file as follows: 1. create the array "ftr" of all UTR, CDS, exons. This will be processed later and pruned based on transcript types we want to keep. In the same go, create the hash "id2tr" of transcripts to keep (based on biotype) which maps from transcript_id to a transcript. At the same time also build the hash "gid2gene" which maps from gene_id to gf_gene_t pointer. 2. build "idx_cds", "idx_tscript", "idx_utr" and "idx_exon" indexes. Use only features from "ftr" which are present in "id2tr". 3. clean data that won't be needed anymore: ftr, id2tr, gid2gene. Data structures. idx_cds, idx_utr, idx_exon, idx_tscript: as described above, regidx structures for fast lookup of exons/transcripts overlapping a region, the payload is a pointer to tscript.cds */ #ifndef GFF_H__ #define GFF_H__ #include #ifndef __FUNCTION__ # define __FUNCTION__ __func__ #endif // Definition of splice_region, splice_acceptor and splice_donor #define N_SPLICE_DONOR 2 #define N_SPLICE_REGION_EXON 3 #define N_SPLICE_REGION_INTRON 8 #define STRAND_REV 0 #define STRAND_FWD 1 #define TRIM_NONE 0 #define TRIM_5PRIME 1 #define TRIM_3PRIME 2 // GFF line types #define GFF_UNKN_LINE 0 #define GFF_TSCRIPT_LINE 1 #define GFF_GENE_LINE 2 /* Genomic features, for fast lookup by position to overlapping features */ #define GF_coding_bit 6 #define GF_is_coding(x) ((x) & (1<