Scalable mutation & analysis for DNA/RNA/Protein

Store-first sequence editing (StringStore, ChunkedStore), deterministic structural mutations, and a clean FastAPI backend—optimized for correctness, scalability, and reproducibility.

Open API docs (local) View on GitHub Try the Demo

Tip: start the server locally, then hit /docs for interactive endpoints.

Pipeline

Sequence → SequenceStructure → BaseStore (StringStore / ChunkedStore)
         → MutationEngine (invert/dup/translocate)
         → Analysis (GC, k-mers, ORFs)
         → FASTA + events.json + manifest.json + log

O(log k) edits

Deterministic (seed)

Headless-safe

Install

Requires Python 3.11+. Install dependencies from requirements.txt.

pip install -r requirements.txt

# macOS/Linux
export PYTHONPATH=./src
python src/SeqMorph_Main.py
# then open http://127.0.0.1:8000/docs

# Windows PowerShell
$env:PYTHONPATH = "$PWD\src"
python .\src\SeqMorph_Main.py
# then open http://127.0.0.1:8000/docs

Quick run & minimal smoke

Add a short DNA sequence, then run structural mutations and view the report.

Add sequence

curl -X POST http://127.0.0.1:8000/sequence/add \
  -H "Content-Type: application/json" \
  -d '{"sequence":"ATGACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"}'

Mutate & analyze

curl -X POST http://127.0.0.1:8000/mutate-and-analyze \
  -H "Content-Type: application/json" \
  -d '{
    "accession_id": "seq_1",
    "struct_rate": 3.0,
    "mean_seg_len": 200,
    "start": 1,
    "seed": 123,
    "save_outputs": true
  }' | python -m json.tool

Endpoints

Method	Path	Purpose	Notes
GET	/health	Liveness probe	Returns `{"status":"ok"}`
POST	/sequence/add	Add raw sequence	Auto-detects type (DNA/RNA/Protein)
POST	/sequence/fetch	Fetch by accession	NCBI/UniProt via `SequenceFetcher`
POST	/mutate-and-analyze	Run structural mutations	Returns full event list + analysis report

Analysis features

Each run produces a concise, headless-safe report comparing original vs. mutated sequences. Designed to scale to large inputs.

GC content & composition

Per-sequence GC%, base counts, deltas.

K-mer frequencies

Counts for k=1..6 (configurable); top Δ between original and mutated.

Codon usage & translation

DNA/RNA codon usage and protein translation summaries.

ORF scan

Start/stop detection; count + longest ORF.

Mutation summary

Event counts (invert/dup/translocate), length change.

Entropy & complexity

Optional Shannon-entropy windows for structure/complexity shifts.

Statistical tests (opt-in)

Chi-square on selected k-mers and simple t-tests for GC% can be enabled for small/medium inputs. For very large k-mer spaces, the report defaults to “top differences” for memory safety.

Example report (truncated)

{
  "length": {"original": 40000, "mutated": 41234, "delta": 1234},
  "gc": {"original": 0.49, "mutated": 0.50, "delta": 0.01},
  "kmer": {
    "k": 4,
    "top_deltas": [{"kmer":"CGCG","delta": 42}, {"kmer":"ATGC","delta": -31}]
  },
  "codon_usage": {"AAA": 120, "AAC": 98, "...": "..."},
  "orf_scan": {"count": 12, "longest": {"start": 1234, "end": 5678, "length": 1345}},
  "events": {"invert": 3, "duplicate": 2, "translocate": 1}
}

Fields vary by sequence type and options; χ² is off by default for large k-mer sets.

Design highlights

Store-first sequence model: callers work with a registry; backends implement BaseStore (get/set/insert/delete + invert/dup/translocate).
Deterministic mutations: one RNG per engine with an explicit seed ensures reproducibility.
Fast analysis: GC%, k-mers, entropy, ORFs/translation (RNA supported) without requiring a window system.
Audit-friendly outputs: optional manifest + per-event logs + FASTA & events JSON.

Contribute on GitHub

DIY smoke test

Quick sanity check without extra files.

# macOS/Linux
SEQ=$(python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
)
curl -s -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d "{\"sequence\":\"$SEQ\"}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d '{"accession_id":"seq_1","struct_rate":3.0,"mean_seg_len":200,"start":1,"seed":123,"save_outputs":true}' | python -m json.tool

# Windows PowerShell
$seq = python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
curl -s -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d "{""sequence"":""$seq""}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d "{""accession_id"":""seq_1"",""struct_rate"":3.0,""mean_seg_len"":200,""start"":1,""seed"":123,""save_outputs"":true}" | python -m json.tool