In final put up, I’ve recognized a magic quantity 36 that provides the utmost variety of combos with 25 amino acids.
Now I’m going to discover how a 36-unit k-mers generated from every protein sequence can be utilized to distinguish every Gene Ontology Group.
To take action, I’m going to create a graph that captures following relationship
GO group => Protein => 36-unit Ok-mers
This can assist me depend the variety of recurring Ok-mer in every GO Time period after which evaluate these overlapping Ok-mer between totally different GO teams.
And I’ve following Python snippets to analyse the info.
I’ve extracted just a few GO phrases with average variety of occurrences, that’s between 500 and 600 totally different protein sequences present in that GO time period.
And following desk summaries the character of those GO teams.
I’ll use the primary three GO teams to exhibit what these Python snippets will do and produce
This second GO group comes from mobile element, which is totally different from the primary GO group from organic course of.
Actually, I’m shocked that there isn’t a overlap in motifs, even the 2 GO teams belong to totally different courses. Alternatively, this isn’t unbelievable as a result of all the motif house spans over 50 hundreds of thousands totally different motifs. Given the variety of motifs that occurred greater than as soon as is within the vary of 1000’s to tens of 1000’s, it’s not straightforward to search out overlapping motif sequences when the GO has solely 500–600 proteins. That is likely to be totally different after we are GO time period with larger variety of proteins.
Following is one other GO time period coming from organic course of. It’s a level to notice that the distribution has a ‘fats’ tail and the variety of diploma is exceptionally numerous. That is likely to be an attention-grabbing space to discover.
Once more, there isn’t a overlap in motifs even each GO teams come from organic course of.
With this preliminary evaluation, it’s attention-grabbing for discover that it’s doable to have non-overlapping motifs between totally different GO teams that can be utilized to tell apart one from one other. I’m not saying that could be a certain case as I simply not-so-randomly decide a subset of GO teams to analyse. The decision is pending for an exhaustive evaluation on all GO teams. Nonetheless, this appears to be an attention-grabbing twist as I’d elevate following hypotheses:
- A small group of recurring motifs inside a GO group is more likely to set the tone of these proteins. It acts like a ‘spine’ that defines the generic nature of these proteins.
- These motifs with single prevalence inside a GO group would supply specificity when it provides a protein’s distinctive operate.
- How these motifs are associated to the Tertiary/Quaternary construction and the way the protein is to work together with substrates . That’s an space for additional exploration
I haven’t discovered what to do subsequent however I’ve just a few concepts that I prefer to check out. Welcome any enter!
So keep tuned!