Scalable mutation & analysis for DNA/RNA/Protein
Store-first sequence editing (StringStore, ChunkedStore), deterministic structural mutations, and a clean FastAPI backend—optimized for correctness, scalability, and reproducibility.
Tip: start the server locally, then hit
/docs
for interactive endpoints.
Pipeline
Sequence → SequenceStructure → BaseStore (StringStore / ChunkedStore) → MutationEngine (invert/dup/translocate) → Analysis (GC, k-mers, ORFs) → FASTA + events.json + manifest.json + log
O(log k) edits
Deterministic (seed)
Headless-safe
Install
Requires Python 3.11+. Install dependencies from requirements.txt
.
pip install -r requirements.txt
# macOS/Linux
export PYTHONPATH=./src
python src/SeqMorph_Main.py
# then open http://127.0.0.1:8000/docs
# Windows PowerShell
$env:PYTHONPATH = "$PWD\src"
python .\src\SeqMorph_Main.py
# then open http://127.0.0.1:8000/docs
Quick run & minimal smoke
Add a short DNA sequence, then run structural mutations and view the report.
Add sequence
curl -X POST http://127.0.0.1:8000/sequence/add \
-H "Content-Type: application/json" \
-d '{"sequence":"ATGACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"}'
Mutate & analyze
curl -X POST http://127.0.0.1:8000/mutate-and-analyze \
-H "Content-Type: application/json" \
-d '{
"accession_id": "seq_1",
"struct_rate": 3.0,
"mean_seg_len": 200,
"start": 1,
"seed": 123,
"save_outputs": true
}' | python -m json.tool
Endpoints
Method | Path | Purpose | Notes |
---|---|---|---|
GET | /health | Liveness probe | Returns {"status":"ok"} |
POST | /sequence/add | Add raw sequence | Auto-detects type (DNA/RNA/Protein) |
POST | /sequence/fetch | Fetch by accession | NCBI/UniProt via SequenceFetcher |
POST | /mutate-and-analyze | Run structural mutations | Returns full event list + analysis report |
Analysis features
Each run produces a concise, headless-safe report comparing original vs. mutated sequences. Designed to scale to large inputs.
GC content & composition
Per-sequence GC%, base counts, deltas.
K-mer frequencies
Counts for k=1..6 (configurable); top Δ between original and mutated.
Codon usage & translation
DNA/RNA codon usage and protein translation summaries.
ORF scan
Start/stop detection; count + longest ORF.
Mutation summary
Event counts (invert/dup/translocate), length change.
Entropy & complexity
Optional Shannon-entropy windows for structure/complexity shifts.
Statistical tests (opt-in)
Chi-square on selected k-mers and simple t-tests for GC% can be enabled for small/medium inputs.
For very large k-mer spaces, the report defaults to “top differences” for memory safety.
Example report (truncated)
{
"length": {"original": 40000, "mutated": 41234, "delta": 1234},
"gc": {"original": 0.49, "mutated": 0.50, "delta": 0.01},
"kmer": {
"k": 4,
"top_deltas": [{"kmer":"CGCG","delta": 42}, {"kmer":"ATGC","delta": -31}]
},
"codon_usage": {"AAA": 120, "AAC": 98, "...": "..."},
"orf_scan": {"count": 12, "longest": {"start": 1234, "end": 5678, "length": 1345}},
"events": {"invert": 3, "duplicate": 2, "translocate": 1}
}
Fields vary by sequence type and options; χ² is off by default for large k-mer sets.
Design highlights
-
Store-first sequence model: callers work with a registry; backends implement
BaseStore
(get/set/insert/delete + invert/dup/translocate). -
Deterministic mutations: one RNG per engine with an explicit
seed
ensures reproducibility. - Fast analysis: GC%, k-mers, entropy, ORFs/translation (RNA supported) without requiring a window system.
- Audit-friendly outputs: optional manifest + per-event logs + FASTA & events JSON.
DIY smoke test
Quick sanity check without extra files.
# macOS/Linux
SEQ=$(python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
)
curl -s -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d "{\"sequence\":\"$SEQ\"}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d '{"accession_id":"seq_1","struct_rate":3.0,"mean_seg_len":200,"start":1,"seed":123,"save_outputs":true}' | python -m json.tool
# Windows PowerShell
$seq = python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
curl -s -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d "{""sequence"":""$seq""}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d "{""accession_id"":""seq_1"",""struct_rate"":3.0,""mean_seg_len"":200,""start"":1,""seed"":123,""save_outputs"":true}" | python -m json.tool