12. Motifs in DNase I peaks#

Note

This exercise is optional!

12.1. Description#

The exercise for this week is to compare the relation between transcription factor motif presence and DNase I signal in different cell types.

The result of your code should be a figure like the following (the values here are not real, but only shown for illustrative purposes):

Figure 1: Example heatmap for this exercise.

This heatmap shows, for every cell type and motif combination, the difference of the mean signal for regions with a motif compared to regions without a motif.

12.1.1. Data#

In the directory /content/gdrive/MyDrive/CFB_files/dnase you will find two files:

  • DNase_table.tsv

  • motifs.txt

The file DNase_table.tsv looks like this:

Figure 2: Contents of DNase_table.tsv.

This tab-separated file contains DNase I signals for six hematopoietic cell types in 10,000 genomic regions. The DNase I reads were counted, log2 transformed and normalized by scaling.

The file motifs.txt contains the name and consensus sequence for seven motifs.

Use the shell command wget to download UCSC Genome Browser’s hg38 human reference sequence. Note that this is a gzip archive (.gz) of the file hg38.fa. Use gunzip to extract hg38.fa from the archive:

!wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
!gunzip hg38.fa.gz

You now have hg38.fa in your Google Colab /content folder. Recall that this folder is temporary and will be cleared in a few hours. You can always re-download the file, or you can save it in your Google Drive. However, this file is ca. 3 Gb, so make sure you have enough space on your Google Drive if you choose to save it.

12.1.2. Retrieving sequences from a genome FASTA file#

You can use the pyfaidx module to retrieve specific sequences from a FASTA file based on genomic coordinates. See the documentation of the module here.

First, you create a FASTA object.

!pip install pyfaidx

from pyfaidx import Fasta
genome = Fasta("hg38.fa")

Then, you can retrieve sequences in a dictionary-like approach.

my_seq = genome["chr2"][5000000:5000020]
print(my_seq.seq)
TCATGACCATAGAGAACAGA

12.1.3. Approach#

Think about your approach before you start writing code! You will have to:

  • scan sequences with the motif consensus sequences;

  • determine, for every motif, which sequences have a match to the motif;

  • calculate the mean signal, depending on the motif occurrence;

  • calculate the difference in mean signal of sequences with a motif match and without a motif match.

12.1.4. Questions#

  • Make a correlation plot of the DNase signal in these hematopoietic cel types? Is this what you expect?

  • Which motifs are specific for certain cell types?

  • Does this agree with the literature?