illumina_fastq.illumina_fastq_parse¶
-
class
illumina_fastq.illumina_fastq_parse.FastqParse(fastq, log=<open file '<stdout>', mode 'w'>, extract_barcodes=[], sample_size=False)[source]¶ Parses the records in an Illumina FASTQ file and stores all records or only those having specific barcodes. The sequence ID, sequence, and quality strings of each FASTQ record are stored in a list of lists of the form
[ [“seqidA”, “ACGT”,”#AAF”], [“seqidB”, “GGAT”,” #AAA”] … ]This list of lists is stored as self.data. A lookup table (dict) is also stored as self.lookup. It is of the form
{ “seqidA”: indexA, “seqidB”: indexB, … }where an index gives the position in the list of the record with the given sequence ID. The sequence ID is stored as the entire title line of a FASTQ record, minus any peripheral whitespace.
Also supports indexing the returned instance object using the header line of a given sequence, i.e. if @GADGET:77:HFNLTBBXX:8:1101:30462:1279 1:N:0:NNAGCA is the read ID of a record that is present in a FASTQ file named reads.fq, then the following returns True:
data = FastqParse(“reads.fq”) data[“@GADGET:77:HFNLTBBXX:8:1101:30462:1279 1:N:0:NNAGCA”] #returns TrueParameters: - fastq – str. Path to the FASTQ file to be parsed. Accepts uncompressed or gzip compressed with a .gz extension.
- log – A file handle for logging. Defaults to STDOUT.
- extract_barcodes – list of one or more barcodes to extract from the FASTQ file. If the barcode is duel-indexed, separate with a ‘+’, i.e. ‘ATCGGT+GCAGCT’, as this is how it is written in the FASTQ file.
- sample_size – int. Indicates the number of records from the start of the FASTQ file to parse. A Falsy value (the default) means that the entire FASTQ file will be parsed.
-
classmethod
getPairedendReadId(read_id)[source]¶ Given either a forward read or reverse read identifier, returns the corresponding paired-end read identifier.
Parameters: read_id – str. The forward read or reverse read identifier. This should be the entire title line of a FASTQ record, minus any trailing whitespace. Returns: The paired-end read identifier (title line). Return type: str Example
Setting read_id to “@COOPER:74:HFTH3BBXX:3:1101:29894:1033 1:N:0:NATGAATC+NGATCTCG” will return @COOPER:74:HFTH3BBXX:3:1101:29894:1033 2:N:0:NATGAATC+NGATCTCG