Midterm test¶
Before you start¶
Please write one single python script.
IMPORTANT: Add your name and ID (matricola) on top of each .py file!
Problem 1¶
Given a list of positive integers (possibly repeated and unsorted) in the range [1,N], write a function that finds the missing values and returns them as a list. Note that the function should not crash if the list is empty. A warning should also be printed in case the user by mistake had negative numbers in the list.
Ex.
S = [1,9,7, 7, 4, 3, 3, 3]
S1 = list(range(10))
print(find_missing(S))
print(find_missing(S1))
print(find_missing([]))
S2 = [1, -72, 4, -3, -3, 3,10]
M = find_missing(S2)
print(M)
should return:
[2, 5, 6, 8]
[]
Warning: list is empty. Returning None
None
Warning -72 is <0. Ignoring it.
Warning -3 is <0. Ignoring it.
Warning -3 is <0. Ignoring it.
[2, 5, 6, 7, 8, 9]
Problem 2¶
The .agp file data_reduced.agp
available here is a compact representation on how a set of assembled contigs made it into the scaffolds. The first few lines are reported below:
ScaffID s_start s_end type contig c_start c_end c_strand
scaffold_1 1 120953 W scf7180000021845 1 120953 -
scaffold_1 120954 121453 N 500 scaffold yes na
scaffold_1 121454 1026498 W scf7180000018491_2 1 905045 +
scaffold_1 1026499 1026998 N 500 scaffold yes na
In particular, the first row states that scaffold_1
from position 1
to 120953
has been built using the sequence of the contig scf7180000021845
from position 1
to 120953
in reverse strand (-
) which means that the sequence has to be reverse-complemented. The second row states that in scaffold_1
positions 120954
to 121453
are a gap made of 500 N
(note the 4th column is N
rather W
that stands for whole genome sequence).
Let’s suppose to have three sequences \(s1="ATAATA"\), \(s2="AAA"\) and \(s3="CCAAA"\), the following agp-formatted entries can be used to create a sequence my_scaff
:
my_scaff 1 6 W s1 1 6 +
my_scaff 7 9 N 3 scaffold yes na
my_scaff 10 12 W s2 1 3 -
my_scaff 13 15 N 3 scaffold yes na
my_scaff 16 17 W s3 1 2 +
this would represent a fasta-formatted sequence:
>my_scaff
ATAATANNNTTTNNNCC
where basically the sequence is composed by s1 as it is, followed by three N, followed by the reverse complement of s2, three N and the first two characters of s3.
The file small_seq.fasta
stores sequence information in .fasta format. A mock entry is the following:
>Chr01
AGGCCTAGGTCTTCCAGAGTCGCTTTTTCCAGCTCCAGACCGATCTCTTCAG
AGGCCAATCGCCAGTTTACCACATACACCCAGACCGATCTCTTCAG
where the first line is the identifier of the read and starts with a “>”. The sequence follows the line with the identifier and can be on multiple lines.
Implement the following python functions:
computeStats(filename, show_output = True)
: gets thefilename
of a .agp file as explained above, stores its content in a suitable data structure of your choice (hint: pandas might help here). Ifshow_output
isFalse
the function only returns the data structure. Otherwise, it counts (and prints) the total number of entries, the total number of scaffolds (hint: you can use DataFrame[column].unique()), total number of contigs (and their total size note that you might have to convert the c_start and c_end column to int with .astype(int)) and total number of gaps (and their total size). The function should also produce a box plot of the number of contigs per scaffold.
Note: The function should return the data structure containing all the data.
Calling:
fn = "data_reduced.agp"
scaffDF = computeStats(fn)
should give:
The file contains 7898 entries
... 1958 scaffolds
... 4928 contigs (tot. size: 873,456,804 bases)
... and 2970 gaps (tot. size: 1,485,000 bases)
printSequence(scaffInfo, scafID, sequenceFile)
: gets thescaffInfo
data structure created by computeStats, a scaffold identifierscaffInfo
and the filename of a fasta formatted filesequenceFile
and ifscafID
is present inscaffInfo
it prints a fasta-formatted string reporting the sequence of the scaffold built as discussed above.
Hint: you can use biophtyon to read the fasta file.
Calling:
scaffDF = computeStats("small.agp", show_output = False)
printSequence(scaffDF,"my_scaff","small_seq.fasta")
print("")
printSequence(scaffDF,"my_scaff2","small_seq.fasta")
print("")
printSequence(scaffDF,"my_other_scaff","small_seq.fasta")
print("")
printSequence(scaffDF,"scaffold3","small_seq.fasta")
should give:
>my_scaff
ATAATANNNTTTNNNCC
>my_scaff2
TATTTTTATATGTATGTAATNNNNNNNNNNTTTATATATA
Warning: scaffold my_other_scaff not present
>scaffold3
NNNNNNNNNNNNNNNNNNNNCCCCGGAGGTACCTCCGGGGCCCCGGAGGT
Available material for exams¶
The course notes can be used as the slides of the theory and practicals. Documentation on the libraries are here (check the midterm simulation page to download them):
You can download all of them here: archive
Download the data¶
Create a “qcbsciprolab-MT1-NAME-SURNAME-ID” folder on the desktop. Download the following data to the folder that you just created.
The big .agp file: data_reduced.agp, the small .agp file small.agp and the fasta file with the sequences: small_seq.fasta.
A possible solution¶
[1]:
"""Solution to exercise1"""
def find_missing(S):
if len(S) > 0:
m = max(S)
vals = [ 0 for i in range(m)]
for x in S:
if x < 0:
print("Warning {} is <0. Ignoring it.".format(x))
else:
vals[x-1] += 1
return [x+1 for x in range(m) if vals[x] == 0]
else:
print("Warning: list is empty. Returning None")
if __name__ == "__main__":
S = [1,9,7, 7, 4, 3, 3, 3]
S1 = list(range(10))
print(find_missing(S))
print(find_missing(S1))
print(find_missing([]))
S2 = [1, -72, 4, -3, -3, 3,10]
M = find_missing(S2)
print(M)
[2, 5, 6, 8]
[]
Warning: list is empty. Returning None
None
Warning -72 is <0. Ignoring it.
Warning -3 is <0. Ignoring it.
Warning -3 is <0. Ignoring it.
[2, 5, 6, 7, 8, 9]
[4]:
"""Solution exercise2"""
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from Bio import SeqIO
def computeStats(filename, show_output = True):
sInfo = pd.read_csv(filename, sep="\t", header = 0)
if show_output:
print("The file contains {} entries".format(sInfo.shape[0]))
scaffolds = sInfo["ScaffID"].unique()
print("... {} scaffolds".format(scaffolds.shape[0]))
contigs = sInfo[sInfo["type"] == "W"]
c_sizes = contigs["c_end"].astype(int) - contigs["c_start"].astype(int)
print("... {} contigs (tot. size: {:,} bases)".format(contigs.shape[0],
np.sum(c_sizes)))
gaps = sInfo[sInfo["type"] == "N"]
print("... and {} gaps (tot. size: {:,} bases)".format(gaps.shape[0],
np.sum(gaps["contig"].astype(int))))
cont_by_scaff = contigs.groupby("ScaffID").aggregate(pd.DataFrame.count)['contig']
#print(cont_by_scaff)
cont_by_scaff.plot(kind = 'box')
plt.ylabel("# contigs per scaffold")
plt.show()
return sInfo
def printSequence(scaffInfo, scafID, sequenceFile):
scaff = scaffInfo[scaffInfo["ScaffID"] == scafID]
hdr = ">" + scafID
seq = ""
if len(scaff) > 0:
seqDict = SeqIO.to_dict(SeqIO.parse(sequenceFile, "fasta"))
for entry in range(len(scaff)):
entry_data = scaff.iloc[entry]
if entry_data["type"] == "N":
seq += "N" * int(entry_data["contig"])
elif entry_data["type"] == "W":
ctg = entry_data["contig"]
if not ctg in seqDict:
print("Warning: contig {} not present. Exiting...".format(ctg))
break
else:
c_s = int(entry_data["c_start"])
c_e = int(entry_data["c_end"])
c_strand = entry_data["c_strand"]
t_s = seqDict[ctg][c_s -1: c_e ]
if c_strand == "-":
t_s = t_s.reverse_complement()
seq += t_s.seq
if len(seq) > 0:
print(hdr)
print(seq)
else:
print("Warning: scaffold {} not present".format(scafID))
if __name__ == "__main__":
fn = "MTsim/data_reduced.agp"
scaffDF = computeStats(fn)
scaffDF = computeStats("MTsim/small.agp", show_output = False)
printSequence(scaffDF,"my_scaff","MTsim/small_seq.fasta")
print("")
printSequence(scaffDF,"my_scaff2","MTsim/small_seq.fasta")
print("")
printSequence(scaffDF,"my_other_scaff","MTsim/small_seq.fasta")
print("")
printSequence(scaffDF,"scaffold3","MTsim/small_seq.fasta")
The file contains 7898 entries
... 1958 scaffolds
... 4928 contigs (tot. size: 873,456,804 bases)
... and 2970 gaps (tot. size: 1,485,000 bases)
>my_scaff
ATAATANNNTTTNNNCC
>my_scaff2
TATTTTTATATGTATGTAATNNNNNNNNNNTTTATATATA
Warning: scaffold my_other_scaff not present
>scaffold3
NNNNNNNNNNNNNNNNNNNNCCCCGGAGGTACCTCCGGGGCCCCGGAGGT