RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)

gene_x 0 like s 364 view s

Tags: bacterium, pipeline, RNA-seq

Urine_vs_MHB

AUM_vs_MHB

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta --gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner star_salmon --gtf_group_features gene_id --gtf_extra_attributes gene_name --featurecounts_group_type gene_biotype --featurecounts_feature_type transcript

  1. Preparing raw data

    1. They are wildtype strains grown in different medium.
    2. AUM - artificial urine medium
    3. Urine - human urine
    4. MHB - Mueller-Hinton broth
    5. AUM(人工尿液培养基):pH值、营养成分、无菌性、渗透压、温度、污染物。
    6. Urine(人类尿液):pH值、比重、温度、污染物、化学成分、微生物负荷。
    7. MHBMueller-Hinton培养基):pH值、无菌性、营养成分、温度、渗透压、抗生素浓度。
    8. mkdir raw_data; cd raw_data
    9. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
    10. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
    11. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
    12. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
    13. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
    14. ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz
    15. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
    16. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
    17. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
    18. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
    19. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
    20. ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz
    21. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
    22. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
    23. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
    24. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
    25. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
    26. ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
  2. (Optional) using trinity to find the most closely reference

    1. Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    2. #https://www.genome.jp/kegg/tables/br08606.html#prok
    3. acb KGB Acinetobacter baumannii ATCC 17978 2007 GenBank
    4. abm KGB Acinetobacter baumannii SDF 2008 GenBank
    5. aby KGB Acinetobacter baumannii AYE 2008 GenBank
    6. abc KGB Acinetobacter baumannii ACICU 2008 GenBank
    7. abn KGB Acinetobacter baumannii AB0057 2008 GenBank
    8. abb KGB Acinetobacter baumannii AB307-0294 2008 GenBank
    9. abx KGB Acinetobacter baumannii 1656-2 2012 GenBank
    10. abz KGB Acinetobacter baumannii MDR-ZJ06 2012 GenBank
    11. abr KGB Acinetobacter baumannii MDR-TJ 2012 GenBank
    12. abd KGB Acinetobacter baumannii TCDC-AB0715 2012 GenBank
    13. abh KGB Acinetobacter baumannii TYTH-1 2012 GenBank
    14. abad KGB Acinetobacter baumannii D1279779 2013 GenBank
    15. abj KGB Acinetobacter baumannii BJAB07104 2013 GenBank
    16. abab KGB Acinetobacter baumannii BJAB0715 2013 GenBank
    17. abaj KGB Acinetobacter baumannii BJAB0868 2013 GenBank
    18. abaz KGB Acinetobacter baumannii ZW85-1 2013 GenBank
    19. abk KGB Acinetobacter baumannii AbH12O-A2 2014 GenBank
    20. abau KGB Acinetobacter baumannii AB030 2014 GenBank
    21. abaa KGB Acinetobacter baumannii AB031 2014 GenBank
    22. abw KGB Acinetobacter baumannii AC29 2014 GenBank
    23. abal KGB Acinetobacter baumannii LAC-4 2015 GenBank
    24. #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CP059040.1) was choosen as reference!
  3. Downloading CP059040.fasta and CP059040.gff from GenBank

  4. (Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

    1. #Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
    2. cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta . # Elements (Anna C.arnes)
    3. cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
    4. cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
    5. cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
    6. rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    7. rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    8. rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    9. (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
    10. ./CP059040.fasta
    11. ./CP059040.bed
    12. ./CP059040.gb
    13. ./CP059040.gff3
    14. ./CP059040.gff3_backup
    15. ./CP059040_full.gb
    16. ./CP059040_gene.gff3
    17. ./CP059040_gene.gtf
    18. ./CP059040_gene_old.gff3
    19. ./CP059040_rRNA.gff3
    20. ./CP059040_rRNA_v.gff3
    21. # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
    22. #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
    23. #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    24. #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    25. #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    26. grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    27. sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
    28. sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    29. gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    30. #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l #69=3887-3803
    31. cp CP059040.gff3 CP059040_backup.gff3
    32. sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    33. grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    34. sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    35. #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
    36. # --------------------------------------------------------------------------------------------------------------------------------------------------
    37. # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
    38. cp CP059040.gff3 CP059040_backup.gff3
    39. sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    40. grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    41. sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    42. grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3 #-->3796
    43. #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
    44. # from gff3 to gtf
    45. sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
    46. sed -i -e "s/;/\"; /g" CP059040_gene.gtf
    47. sed -i -e "s/=/=\"/g" CP059040_gene.gtf
    48. #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
    49. #using editor instead!
    50. #The following is GTF-format.
    51. CP000521.1 Genbank exon 95 1492 . + . transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";
    52. #NZ_MJHA01000001.1 RefSeq region 1 8663 . + . ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
    53. #NZ_MJHA01000001.1 RefSeq gene 228 746 . - . ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
    54. #NZ_MJHA01000001.1 Protein Homology CDS 228 746 . - 0 ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11
    55. ##gff-version 3
    56. ##sequence-region CP059040.1 1 3980852
    57. ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470
    58. gffread -E -F --bed CP059040.gff3 -o CP059040.bed #-->3796
    59. ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
    60. ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
    61. #[01/21 10:57:46] Loading reference annotation (guides)..
    62. #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
    63. #[01/21 10:57:46] 3796 reference transcripts loaded.
    64. #Default stack size for threads: 8388608
    65. #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
    66. #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
    67. #Please make sure the -G annotation file uses the same naming convention for the genome sequences.
    68. #[01/21 10:58:30] All threads finished.
    69. # ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    70. # The specified gene identifier attribute is 'Name'
    71. # An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    72. # The program has to termin
    73. # ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    74. # The specified gene identifier attribute is 'gene_biotype'
    75. # An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    76. # The program has to terminate.
    77. #grep "ID=cds-" CP059040.gff3 | wc -l
    78. #grep "ID=exon-" CP059040.gff3 | wc -l
    79. #grep "ID=gene-" CP059040.gff3 | wc -l #the same as H0N29_18980/5=3796
    80. grep "gbkey=" CP059040.gff3 | wc -l 7695
    81. grep "ID=id-" CP059040.gff3 | wc -l 5
    82. grep "locus_tag=" CP059040.gff3 | wc -l 7689
    83. #...
    84. cds 3701 locus_tag=xxxx, no gene_biotype
    85. exon 96 locus_tag=xxxx, no gene_biotype
    86. gene 3796 locus_tag=xxxx, gene_biotype=xxxx,
    87. id (riboswitch+direct_repeat,5) both no --> ignoring them!! # grep "ID=id-" CP059040.gff3
    88. rna 96 locus_tag=xxxx, no gene_biotype
    89. ------------------
    90. 7694
    91. cp CP059040.gff3_backup CP059040.gff3
    92. grep "^##" CP059040.gff3 > CP059040_gene.gff3
    93. grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
    94. #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
    95. sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
  5. Preparing the directory trimmed

    1. mkdir trimmed trimmed_unpaired;
    2. for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
    3. for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
    4. java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    5. done
  6. Preparing samplesheet.csv

    1. sample,fastq_1,fastq_2,strandedness
    2. AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
    3. AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
    4. AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
    5. MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
    6. MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
    7. MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
    8. Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
    9. Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
    10. Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
  7. nextflow run

    1. #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    2. docker pull nfcore/rnaseq
    3. ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    4. #Default: --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
    5. #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    6. # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
    7. #Checking the record (see below) in results/genome/CP059040.gtf
    8. #In ./results/genome/CP059040.gtf e.g. "CP059040.1 Genbank transcript 1 1398 . + . transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
    9. #--featurecounts_feature_type 'transcript' returns only the tRNA results
    10. #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    11. grep -P "\texon\t" CP059040.gff | sort | wc -l #96
    12. grep -P "cmsearch\texon\t" CP059040.gff | wc -l #=10 ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
    13. grep -P "Genbank\texon\t" CP059040.gff | wc -l #=12 16S and 23S ribosomal RNA
    14. grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l #tRNA 74
    15. wc -l star_salmon/AUM_r3/quant.genes.sf #--featurecounts_feature_type 'transcript' results in 96 records!
    16. grep -P "\tCDS\t" CP059040.gff | wc -l #3701
    17. sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
    18. grep -P "\texon\t" CP059040_m.gff | sort | wc -l #3797
    19. # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    20. --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    21. # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    22. (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    23. # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  8. Import data and pca-plot

    1. #mamba activate r_env
    2. #install.packages("ggfun")
    3. # Import the required libraries
    4. library("AnnotationDbi")
    5. library("clusterProfiler")
    6. library("ReactomePA")
    7. library(gplots)
    8. library(tximport)
    9. library(DESeq2)
    10. #library("org.Hs.eg.db")
    11. library(dplyr)
    12. library(tidyverse)
    13. #install.packages("devtools")
    14. #devtools::install_version("gtable", version = "0.3.0")
    15. library(gplots)
    16. library("RColorBrewer")
    17. #install.packages("ggrepel")
    18. library("ggrepel")
    19. # install.packages("openxlsx")
    20. library(openxlsx)
    21. library(EnhancedVolcano)
    22. library(DESeq2)
    23. setwd("~/DATA/Data_Tam_RNAseq_2024/results/star_salmon")
    24. # Define paths to your Salmon output quantification files
    25. files <- c("AUM_r1" = "./AUM_r1/quant.sf",
    26. "AUM_r2" = "./AUM_r2/quant.sf",
    27. "AUM_r3" = "./AUM_r3/quant.sf",
    28. "Urine_r1" = "./Urine_r1/quant.sf",
    29. "Urine_r2" = "./Urine_r2/quant.sf",
    30. "Urine_r3" = "./Urine_r3/quant.sf",
    31. "MHB_r1" = "./MHB_r1/quant.sf",
    32. "MHB_r2" = "./MHB_r2/quant.sf",
    33. "MHB_r3" = "./MHB_r3/quant.sf")
    34. # Import the transcript abundance data with tximport
    35. txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    36. # Define the replicates and condition of the samples
    37. replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
    38. condition <- factor(c("AUM","AUM","AUM", "Urine","Urine","Urine", "MHB","MHB","MHB"))
    39. # Define the colData for DESeq2
    40. colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    41. # -- transcript-level count data (x2) --
    42. # Create DESeqDataSet object
    43. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    44. write.csv(counts(dds), file="transcript_counts.csv")
    45. # -- gene-level count data (x2) --
    46. # Read in the tx2gene map from salmon_tx2gene.tsv
    47. tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    48. # Set the column names
    49. colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    50. # Remove the gene_name column if not needed
    51. tx2gene <- tx2gene[,1:2]
    52. # Import and summarize the Salmon data with tximport
    53. txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    54. # Continue with the DESeq2 workflow as before...
    55. colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    56. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition+replicate)
    57. #dds <- dds[rowSums(counts(dds) > 3) > 2, ] #3796->3487
    58. write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    59. dim(counts(dds))
    60. head(counts(dds), 10)
    61. rld <- rlogTransformation(dds)
    62. #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
    63. #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
    64. #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
    65. #So, the typical workflow is:
    66. # - Create the DESeqDataSet object.
    67. # - Use estimateSizeFactors to normalize for library size.
    68. # - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
    69. # - Use DESeq for the differential expression analysis.
    70. # - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    71. # draw simple pca and heatmap
    72. #mat <- assay(rld)
    73. #mm <- model.matrix(~condition, colData(rld))
    74. #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    75. #assay(rld) <- mat
    76. # -- pca --
    77. png("pca.png", 1200, 800)
    78. plotPCA(rld, intgroup=c("condition"))
    79. dev.off()
    80. # -- heatmap --
    81. png("heatmap.png", 1200, 800)
    82. distsRL <- dist(t(assay(rld)))
    83. mat <- as.matrix(distsRL)
    84. hc <- hclust(distsRL)
    85. hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    86. heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    87. dev.off()
  9. (Optional (ERROR-->need to be debugged!) ) estimate size factors and dispersion values.

    1. #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
    2. #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
    3. #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    4. sizeFactors(dds)
    5. #NULL
    6. # Estimate size factors
    7. dds <- estimateSizeFactors(dds)
    8. # Estimate dispersions
    9. dds <- estimateDispersions(dds)
    10. #> sizeFactors(dds)
    11. #control_r1 control_r2 HSV.d2_r1 HSV.d2_r2 HSV.d4_r1 HSV.d4_r2 HSV.d6_r1
    12. #2.3282468 2.0251928 1.8036883 1.3767551 0.9341929 1.0911693 0.5454526
    13. #HSV.d6_r2 HSV.d8_r1 HSV.d8_r2
    14. #0.4604461 0.5799834 0.6803681
    15. # (DEBUG) If avgTxLength is Necessary
    16. #To simplify the computation and ensure sizeFactors are calculated:
    17. assays(dds)$avgTxLength <- NULL
    18. dds <- estimateSizeFactors(dds)
    19. sizeFactors(dds)
    20. #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
    21. dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
    22. sizeFactors(dds)
    23. # If alone with virus data, the following BUG occured:
    24. #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
    25. HeLa_TO_r1 HeLa_TO_r2
    26. 0.9978755 1.1092227
    27. data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    28. #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    29. 1/0.9978755=1.002129023
    30. 1/1.1092227=
    31. #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor --effectiveGenomeSize 2864785220
    32. bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023 --effectiveGenomeSize 2864785220
    33. bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor 0.901532217 --effectiveGenomeSize 2864785220
    34. raw_counts <- counts(dds)
    35. normalized_counts <- counts(dds, normalized=TRUE)
    36. #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
    37. #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    38. #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    39. estimSf <- function (cds){
    40. # Get the count matrix
    41. cts <- counts(cds)
    42. # Compute the geometric mean
    43. geomMean <- function(x) prod(x)^(1/length(x))
    44. # Compute the geometric mean over the line
    45. gm.mean <- apply(cts, 1, geomMean)
    46. # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    47. gm.mean[gm.mean == 0] <- NA
    48. # Divide each line by its corresponding geometric mean
    49. # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    50. # MARGIN: 1 or 2 (line or columns)
    51. # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    52. # FUN: the function to be applied
    53. cts <- sweep(cts, 1, gm.mean, FUN="/")
    54. # Compute the median over the columns
    55. med <- apply(cts, 2, median, na.rm=TRUE)
    56. # Return the scaling factor
    57. return(med)
    58. }
    59. #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
    60. #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
    61. #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
    62. #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
    63. #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
    64. #DESeq2’s median of ratios [1]
    65. #EdgeR’s trimmed mean of M values (TMM) [2]
    66. #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html #very good website!
    67. test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
    68. sum(test_normcount != normalized_counts)
  10. Select the differentially expressed genes

    1. #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    2. #https://www.biostars.org/p/282295/
    3. #https://www.biostars.org/p/335751/
    4. #> dds$condition
    5. #[1] AUM AUM AUM Urine Urine Urine MHB MHB MHB
    6. #Levels: AUM MHB Urine
    7. #CONSOLE: mkdir star_salmon/degenes
    8. setwd("degenes")
    9. #---- relevel to control ----
    10. dds$condition <- relevel(dds$condition, "MHB")
    11. dds = DESeq(dds, betaPrior=FALSE)
    12. resultsNames(dds)
    13. clist <- c("AUM_vs_MHB","Urine_vs_MHB")
    14. for (i in clist) {
    15. contrast = paste("condition", i, sep="_")
    16. res = results(dds, name=contrast)
    17. res <- res[!is.na(res$log2FoldChange),]
    18. res_df <- as.data.frame(res)
    19. write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
    20. up <- subset(res_df, padj<=0.05 & log2FoldChange>=1.35)
    21. down <- subset(res_df, padj<=0.05 & log2FoldChange<=-1.35)
    22. write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    23. write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    24. }
    25. # -- Under host-env --
    26. grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
    27. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-all.txt AUM_vs_MHB-all.csv
    28. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-up.txt AUM_vs_MHB-up.csv
    29. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff AUM_vs_MHB-down.txt AUM_vs_MHB-down.csv
    30. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-all.txt Urine_vs_MHB-all.csv
    31. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-up.txt Urine_vs_MHB-up.csv
    32. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_gene.gff Urine_vs_MHB-down.txt Urine_vs_MHB-down.csv
    33. #for i in AUM_vs_MHB Urine_vs_MHB; do
    34. # echo "contrast = paste(\"condition\", \"${i}\", sep=\"_\")"
    35. # echo "res = results(dds, name=contrast)"
    36. # echo "res <- res[!is.na(res$log2FoldChange),]"
    37. # echo "res_df <- as.data.frame(res)"
    38. # #selectLab = selectLab_italics,
    39. # echo "png(\"${i}.png\",width=1200, height=1000)"
    40. # #legendPosition = 'right',legendLabSize = 12, arrowheads = FALSE,
    41. # #subtitle=expression(~Delta*\"$(echo $i | cut -d'_' -f1) versus \" *~Delta*\"$(echo $i | cut -d'_' -f3)\"))"
    42. # echo "EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'padj', pCutoff=5e-2, FCcutoff=1.2, title='', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha=1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle=expression(\"$(echo $i | cut -d'_' -f1) versus $(echo $i | cut -d'_' -f3)\"))"
    43. # echo "dev.off()"
    44. #done
    45. res <- read.csv("AUM_vs_MHB-all.csv")
    46. # Replace empty GeneName with modified GeneID
    47. res$GeneName <- ifelse(
    48. res$GeneName == "" | is.na(res$GeneName),
    49. gsub("gene-", "", res$GeneID),
    50. res$GeneName
    51. )
    52. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    53. #print(duplicated_genes)
    54. # [1] "bfr" "lipA" "ahpF" "pcaF" "alr" "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
    55. #[11] "pcaF" "tuf" "galE" "murI" "yccS" "rrf" "rrf" "arsB" "ptsP" "umuD"
    56. #[21] "map" "pgaB" "rrf" "rrf" "rrf" "pgaD" "uraH" "benE"
    57. #res[res$GeneName == "bfr", ]
    58. #1st_strategy First occurrence is kept and Subsequent duplicates are removed
    59. #res <- res[!duplicated(res$GeneName), ]
    60. #2nd_strategy keep the row with the smallest padj value for each GeneName
    61. res <- res %>%
    62. group_by(GeneName) %>%
    63. slice_min(padj, with_ties = FALSE) %>%
    64. ungroup()
    65. res <- as.data.frame(res)
    66. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    67. res <- res[order(res$padj, -res$log2FoldChange), ]
    68. # Assuming res is your dataframe and already processed
    69. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    70. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    71. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    72. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    73. # Create a new workbook
    74. wb <- createWorkbook()
    75. # Add the complete dataset as the first sheet
    76. addWorksheet(wb, "Complete_Data")
    77. writeData(wb, "Complete_Data", res)
    78. # Add the up-regulated genes as the second sheet
    79. addWorksheet(wb, "Up_Regulated")
    80. writeData(wb, "Up_Regulated", up_regulated)
    81. # Add the down-regulated genes as the third sheet
    82. addWorksheet(wb, "Down_Regulated")
    83. writeData(wb, "Down_Regulated", down_regulated)
    84. # Save the workbook to a file
    85. saveWorkbook(wb, "Gene_Expression_AUM_vs_MHB.xlsx", overwrite = TRUE)
    86. # Set the 'GeneName' column as row.names
    87. rownames(res) <- res$GeneName
    88. # Drop the 'GeneName' column since it's now the row names
    89. res$GeneName <- NULL
    90. head(res)
    91. ## Ensure the data frame matches the expected format
    92. ## For example, it should have columns: log2FoldChange, padj, etc.
    93. #res <- as.data.frame(res)
    94. ## Remove rows with NA in log2FoldChange (if needed)
    95. #res <- res[!is.na(res$log2FoldChange),]
    96. # Replace padj = 0 with a small value
    97. res$padj[res$padj == 0] <- 1e-150
    98. #library(EnhancedVolcano)
    99. # Assuming res is already sorted and processed
    100. png("AUM_vs_MHB.png", width=1200, height=2000)
    101. #max.overlaps = 10
    102. EnhancedVolcano(res,
    103. lab = rownames(res),
    104. x = 'log2FoldChange',
    105. y = 'padj',
    106. pCutoff = 1e-2,
    107. FCcutoff = 2,
    108. title = '',
    109. subtitleLabSize = 18,
    110. pointSize = 3.0,
    111. labSize = 5.0,
    112. colAlpha = 1,
    113. legendIconSize = 4.0,
    114. drawConnectors = TRUE,
    115. widthConnectors = 0.5,
    116. colConnectors = 'black',
    117. subtitle = expression("AUM versus MHB"))
    118. dev.off()
    119. res <- read.csv("Urine_vs_MHB-all.csv")
    120. # Replace empty GeneName with modified GeneID
    121. res$GeneName <- ifelse(
    122. res$GeneName == "" | is.na(res$GeneName),
    123. gsub("gene-", "", res$GeneID),
    124. res$GeneName
    125. )
    126. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    127. res <- res %>%
    128. group_by(GeneName) %>%
    129. slice_min(padj, with_ties = FALSE) %>%
    130. ungroup()
    131. res <- as.data.frame(res)
    132. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    133. res <- res[order(res$padj, -res$log2FoldChange), ]
    134. # Assuming res is your dataframe and already processed
    135. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    136. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    137. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    138. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    139. # Create a new workbook
    140. wb <- createWorkbook()
    141. # Add the complete dataset as the first sheet
    142. addWorksheet(wb, "Complete_Data")
    143. writeData(wb, "Complete_Data", res)
    144. # Add the up-regulated genes as the second sheet
    145. addWorksheet(wb, "Up_Regulated")
    146. writeData(wb, "Up_Regulated", up_regulated)
    147. # Add the down-regulated genes as the third sheet
    148. addWorksheet(wb, "Down_Regulated")
    149. writeData(wb, "Down_Regulated", down_regulated)
    150. # Save the workbook to a file
    151. saveWorkbook(wb, "Gene_Expression_Urine_vs_MHB.xlsx", overwrite = TRUE)
    152. # Set the 'GeneName' column as row.names
    153. rownames(res) <- res$GeneName
    154. # Drop the 'GeneName' column since it's now the row names
    155. res$GeneName <- NULL
    156. head(res)
    157. ## Ensure the data frame matches the expected format
    158. ## For example, it should have columns: log2FoldChange, padj, etc.
    159. #res <- as.data.frame(res)
    160. ## Remove rows with NA in log2FoldChange (if needed)
    161. #res <- res[!is.na(res$log2FoldChange),]
    162. # Replace padj = 0 with a small value
    163. res$padj[res$padj == 0] <- 1e-305
    164. #library(EnhancedVolcano)
    165. # Assuming res is already sorted and processed
    166. png("Urine_vs_MHB.png", width=1200, height=2000)
    167. #max.overlaps = 10
    168. EnhancedVolcano(res,
    169. lab = rownames(res),
    170. x = 'log2FoldChange',
    171. y = 'padj',
    172. pCutoff = 1e-2,
    173. FCcutoff = 2,
    174. title = '',
    175. subtitleLabSize = 18,
    176. pointSize = 3.0,
    177. labSize = 5.0,
    178. colAlpha = 1,
    179. legendIconSize = 4.0,
    180. drawConnectors = TRUE,
    181. widthConnectors = 0.5,
    182. colConnectors = 'black',
    183. subtitle = expression("Urine versus MHB"))
    184. dev.off()
  11. Report

    1. Attached are the results of the analysis.
    2. In the Urine_vs_MHB comparison, we identified a total of 259 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 138 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). Notably, the following genes have a p-adjusted value of 0, indicating very high confidence in their differential expression. The bas-series genes (basA, basB, basC, basD, basE, basJ) are particularly prominent:
    3. GeneName GeneID baseMean log2FoldChange lfcSE stat pvalue padj
    4. basJ gene-H0N29_05120 11166.90 11.42 0.30 38.27 0 0
    5. basE gene-H0N29_05085 12006.52 10.45 0.23 45.76 0 0
    6. basD gene-H0N29_05080 12217.80 10.15 0.24 42.42 0 0
    7. bauA gene-H0N29_05070 25280.68 9.55 0.19 51.48 0 0
    8. basA gene-H0N29_05040 9750.68 9.02 0.18 48.90 0 0
    9. basC gene-H0N29_05075 5034.14 8.58 0.21 40.07 0 0
    10. H0N29_08320 gene-H0N29_08320 4935.78 7.87 0.20 40.01 0 0
    11. barB gene-H0N29_05105 5187.29 7.81 0.18 43.39 0 0
    12. H0N29_09380 gene-H0N29_09380 3477.26 7.41 0.19 38.91 0 0
    13. H0N29_13950 gene-H0N29_13950 13959.05 6.85 0.15 45.70 0 0
    14. H0N29_10825 gene-H0N29_10825 3664.70 6.44 0.17 37.59 0 0
    15. H0N29_10790 gene-H0N29_10790 2574.12 6.41 0.17 37.86 0 0
    16. H0N29_10010 gene-H0N29_10010 9376.84 -8.14 0.19 -43.70 0 0
    17. In the AUM_vs_MHB comparison, we identified a total of 149 up-regulated genes (log2FoldChange > 2 and padj < 1e-2) and 65 down-regulated genes (log2FoldChange < -2 and padj < 1e-2) (please refer to the attached volcano plot and Excel files). The following genes also show a p-adjusted value of 0, indicating very high confidence in their differential expression:
    18. GeneName GeneID baseMean log2FoldChange lfcSE stat pvalue padj
    19. putA gene-H0N29_09870 36100.24 -7.25 0.15 -49.78 0 0
    20. H0N29_10010 gene-H0N29_10010 9376.84 -7.43 0.18 -41.96 0 0
    21. To ensure proper visualization, I replaced the padj = 0 values with small numbers: 1e-305 for Urine_vs_MHB and 1e-150 for AUM_vs_MHB.
    22. We have now identified the significantly expressed genes. If you would like any further analysis based on these genes or need additional plots, please let me know.
  12. (TODO) clustering the genes and draw heatmap

    1. for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do echo "cut -d',' -f1-1 ${i}-up_annotated.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down_annotated.txt > ${i}-down.id"; done
    2. cat *.id | sort -u > ids
    3. #add Gene_Id in the first line, delete the ""
    4. GOI <- read.csv("ids")$Gene_Id #4647
    5. RNASeq.NoCellLine <- assay(rld)
    6. #install.packages("gplots")
    7. library("gplots")
    8. #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman
    9. datamat = RNASeq.NoCellLine[GOI, ]
    10. #datamat = RNASeq.NoCellLine
    11. write.csv(as.data.frame(datamat), file ="gene_expressions.txt")
    12. constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    13. if(any(constant_rows)) {
    14. cat("Removing", sum(constant_rows), "constant rows.\n")
    15. datamat <- datamat[!constant_rows, ]
    16. }
    17. hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    18. hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    19. mycl = cutree(hr, h=max(hr$height)/1.05)
    20. mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    21. mycol = mycol[as.vector(mycl)]
    22. #png("DEGs_heatmap.png", width=900, height=800)
    23. #cex.lab=10, labRow="",
    24. png("DEGs_heatmap.png", width=800, height=1000)
    25. heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
    26. scale='row',trace='none',col=bluered(75), cexCol=1.8,
    27. RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8)) #rownames(datamat)
    28. #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    29. dev.off()
    30. #### cluster members #####
    31. write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    32. write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
    33. write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
    34. write.csv(names(subset(mycl, mycl == '4')),file='cluster4.txt')
    35. #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    36. ~/Tools/csv2xls-0.4/csv_to_xls.py \
    37. significant_gene_expressions.txt \
    38. -d',' -o DEGs_heatmap_expression_data.xls;
    39. #### cluster members (advanced) #####
    40. subset_1<-names(subset(mycl, mycl == '1'))
    41. data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575
    42. subset_2<-names(subset(mycl, mycl == '2'))
    43. data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855
    44. subset_3<-names(subset(mycl, mycl == '3'))
    45. data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217
    46. subset_4<-names(subset(mycl, mycl == '4'))
    47. data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) #
    48. subset_5<-names(subset(mycl, mycl == '5'))
    49. data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) #
    50. # Initialize an empty data frame for the annotated data
    51. annotated_data <- data.frame()
    52. # Determine total number of genes
    53. total_genes <- length(rownames(data))
    54. # Loop through each gene to annotate
    55. for (i in 1:total_genes) {
    56. gene <- rownames(data)[i]
    57. result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
    58. filters = 'ensembl_gene_id',
    59. values = gene,
    60. mart = ensembl)
    61. # If multiple rows are returned, take the first one
    62. if (nrow(result) > 1) {
    63. result <- result[1, ]
    64. }
    65. # Check if the result is empty
    66. if (nrow(result) == 0) {
    67. result <- data.frame(ensembl_gene_id = gene,
    68. external_gene_name = NA,
    69. gene_biotype = NA,
    70. entrezgene_id = NA,
    71. chromosome_name = NA,
    72. start_position = NA,
    73. end_position = NA,
    74. strand = NA,
    75. description = NA)
    76. }
    77. # Transpose expression values
    78. expression_values <- t(data.frame(t(data[gene, ])))
    79. colnames(expression_values) <- colnames(data)
    80. # Combine gene information and expression data
    81. combined_result <- cbind(result, expression_values)
    82. # Append to the final dataframe
    83. annotated_data <- rbind(annotated_data, combined_result)
    84. # Print progress every 100 genes
    85. if (i %% 100 == 0) {
    86. cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    87. }
    88. }
    89. # Save the annotated data to a new CSV file
    90. #write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    91. write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    92. write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    93. write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
    94. write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
    95. #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum