TITLE: Large Language Models as Learning Tools for Electron Ionization and Spectral Analysis in Organic Chemistry ABSTRACT: The objective of this project is to train artificial intelligence models on existing electron ionization (EI) mass spectral datasets to recognize fragmentation patterns of organic molecules. By focusing on compounds with molecular masses below 500g/mol for higher precision, the model will first learn from curated databases. A training “playground” was created where the AI model was trained using mass spectral fragmentation of acetone from three sources. Then the model was evaluated on its ability to recognize acetone as compared to methyl acetate. The model produced a comparison score for the two unknowns, an average of 97.5% similarity for acetone and 88.7% for methyl acetate. The model will be trained with additional molecular fragmentation in the future and will be tested with molecules that were excluded from training but have known EI spectra. This set of experiments will assess whether the model can correctly predict fragmentation patterns. Success would demonstrate the capacity of AI not only to replicate established MS fragmentation, but also to model fragmentation patterns for Organic Chemistry Students. Ultimately, this approach could expand current knowledge of organic fragmentation for molecules not previously analyzed with EI-MS, improve spectral prediction, and provide researchers with a new tool to simulate and investigate compounds of interest for scientific discovery. AUTHOR: David T. Burton ADVISOR: Dr. Leslie A. Hiatt INSTITUTION: Austin Peay State University DATE PRODUCED: 11/18/2025 The following is a guide for replicating this experiment. Below the cutline was generated by a paid subscription to Chat GPT Plus. The author intends for this guide to be open for academic use. Please use it freely and send any feedback to David Burton at dburton3@students.apsu.edu ------------------------CUT LINE---beginning of instructions---CUT LINE----------------------------- Trial 3: ms-simple EI Mass Spectral Matcher — Replication Guide Shape A. Overview This guide provides: ONE MASTER PROMPT for a VS Code AI Agent (GPT 5.1 CODEX or similar) Step-by-step instructions so students and researchers can rebuild Trial 3: a local EI mass spectral library search tool with: a CLI matcher MSP → CSV conversion real-data pipeline (toluene) HTML reporting a Streamlit UI with demo data All runs are offline and reproducible. Shape B. Prerequisites Before you start: Python 3.10 or 3.11 installed VS Code installed Git installed Access to the VS Code AI Agent (GPT 5.1 CODEX or equivalent) Basic command line comfort (PowerShell, Terminal, etc.) Shape C. MASTER PROMPT (for VS Code Agent) Paste this entire block into the VS Code Agent in an empty repo. You are an expert Python developer and mass spectrometry tooling engineer. Your goal is to build a small, offline EI mass spectral matcher called "ms-simple" from scratch in this empty repository. It must implement a local library search engine for electron ionization (EI) mass spectra, with a CLI, HTML report, and a simple Streamlit UI. Do not use the network; everything runs locally. ===================================== 1) Core library matching ===================================== Create the following modules: - ms_similarity.py - ms_formats.py - ms_match.py - tests/ (with pytest tests) - data/ (for example spectra) - demo/ (for demo bundles) - pyproject.toml - README.md Implement in ms_similarity.py: - weighted_cosine(query_vec, lib_vec, max_intensity) - Inputs: two dense dicts {int m/z: float intensity}, and a max intensity value. - Use weighted cosine similarity where weights are sqrt(m/z): score = (sum_i w_i * A_i * B_i) / (sqrt(sum_i w_i * A_i^2) * sqrt(sum_i w_i * B_i^2)), w_i = sqrt(m/z_i) - Return a float between 0 and 1. Implement in ms_formats.py: - read_csv_spectrum(path) -> list[(mz: float, intensity: float)] - Read a CSV with header "m_over_z,intensity". - Ignore non-positive intensities. - preprocess_to_dense(peaks, normalize: bool, noise_floor_pct: float, mz_tol: float) -> (vec: dict[int, float], max_intensity: float) - peaks: list of (mz, intensity). - Bin m/z values to integer bins with tolerance ±mz_tol. - If normalize is True: * scale intensities so the base peak becomes 1000. - Apply a noise floor AFTER normalization: * set intensities below noise_floor_pct % of the base peak to zero. - Return: * vec: dense dict {int m/z: float intensity} * max_intensity: the base peak intensity before noise floor. ===================================== 2) CLI: ms_match.py ===================================== Implement a CLI interface in ms_match.py with arguments: - --query PATH (required): CSV query spectrum. - --library PATH (required): directory containing library CSV files. - --top N (default 10) - --normalize (flag) - --noise_floor FLOAT (default 1.0; percent of base peak) - --mz_tol FLOAT (default 0.49) - --unique (flag): collapse duplicate spectra using a vector hash - --json (flag): output JSON instead of plain text - --convert_msp INPUT.msp --out OUTPUT.csv (convert first MSP entry to CSV, if called without --query/--library) Behavior: - If --convert_msp is given: * Read INPUT.msp * Parse the first MSP entry and write peaks to OUTPUT.csv as m_over_z,intensity. * Exit after conversion. - Otherwise: * Read query CSV via read_csv_spectrum. * Read all *.csv library spectra in the given directory. * Use preprocess_to_dense on both query and each library, with given normalize, noise_floor_pct, mz_tol. * Compute weighted_cosine for each library spectrum. * Sort results by score descending, then filename ascending. * If --unique is set: - Compute a normalized vector hash: * round nonzero intensities to ~6 decimals, * sort by m/z, * hash as a string, * keep only one entry per hash. * Print top N results. Output formats: - If --json is set: print a JSON object: { "results": [ {"score": float, "file": "name.csv"}, ... ] } - Otherwise: print "score, filename" per line. ===================================== 3) MSP parsing and golden query ===================================== In ms_formats.py: - parse_msp_entries(text) -> list[dict] * Parse multi-entry MSP text. * Each dict should hold: - "name", "formula" if present. - "peaks": list[(mz: float, intensity: float)]. In ms_match.py: - Implement --convert_msp as described above. - Later we may extend it, but keep it simple and robust. Create a small helper script make_query_from_lib.py: - Takes an existing library CSV (e.g., toluene.csv). - Adds small random jitter to intensities (with a fixed random seed). - Writes query_toluene_golden.csv for testing. - This will simulate a "real" noisy query that should still match toluene. ===================================== 4) Real data pipeline (toluene) ===================================== Create example data: - data/library/lib_A.csv - data/library/lib_B.csv - data/library/toluene.csv - data/queries/query_toluene_golden.csv (generated by make_query_from_lib.py with a fixed seed) Implement test_realdata_pipeline.py: - Test that: * When matching query_toluene_golden.csv vs data/library with normalize=True, noise_floor ~1%, mz_tol ~0.49, unique=True, toluene.csv is the top hit with similarity > 0.9. - Also include a test test_top1_is_canonical_on_tie(): * Create a duplicate of toluene.csv so there is a tie. * Ensure tie-breaking uses filename to pick a canonical file deterministically. ===================================== 5) HTML report and console script ===================================== In ms_match.py: - Implement write_report(path, query_path, results, query_peaks, top1_peaks): * Create a small HTML file with: - Title: "Top matches" - Run timestamp - Table of top N results (Rank, Score, Filename) - Two tables for peaks: + Query: first 20 peaks (m/z, intensity) + Top-1: first 20 peaks - Add a CLI flag --report PATH: * After computing results, if --report is set and results are not empty, call write_report and print "Wrote report: PATH". Expose cli() in ms_match.py and main guard: - def cli(): main() - if __name__ == "__main__": cli() In pyproject.toml: - Set up `[project]` metadata. - Add console script: [project.scripts] ms-simple = "ms_simple.ms_match:cli" ===================================== 6) Streamlit app (app.py) ===================================== Create app.py that: - Uses Streamlit for a single-page UI. Functionality: - Upload query CSV file (m_over_z,intensity). - Upload library ZIP (flat root; one or more CSV spectra). - Sidebar options: * Normalize (checkbox; default True) * Noise floor (%) slider from 0.0 to 5.0 (default 1.0) * m/z tolerance slider from 0.10 to 1.00 (default 0.49) * Unique (collapse duplicates) checkbox (default True) * Top N slider from 1 to 50 (default 10) - On "Run match": * Unzip library into a temp folder. * For each CSV file, use read_csv_spectrum and preprocess_to_dense. * Use weighted_cosine and the same logic as the CLI to compute scores. * Sort, apply unique if requested, and display: - A table of Top N matches (Rank, Score, Filename). - First 20 processed peaks for query and top-1 hit. - "Generate report" button: * Call write_report(...) to generate an HTML report in a temp location. * Provide a Streamlit download_button so the user can save report.html. ===================================== 7) Demo data and README ===================================== Create demo data: - demo/library_demo.zip containing: - toluene.csv - lib_A.csv - lib_B.csv - demo/query_toluene_golden.csv Update README.md with a "Quick Demo" section: - pip install -e ".[ui]" - streamlit run app.py In the Streamlit app: - Upload demo/query_toluene_golden.csv as Query. - Upload demo/library_demo.zip as Library. - Click "Run match" then "Generate report". ===================================== 8) Tests and quality ===================================== - Configure pyproject.toml to use pytest. - Add tests: - test_similarity.py for weighted_cosine. - test_parsers.py for CSV/MSP parsing. - test_realdata_pipeline.py for toluene pipeline. - test_report_smoke.py to ensure --report creates a valid HTML. - test_app_import.py to ensure app.py is importable. All tests must pass: - pytest -q Finally: - Run `pip install -e ".[ui]"`. - Run `pytest -q`. - Show a brief summary of: - Files created. - Test results. - Example CLI and Streamlit commands. Shape D. Step-by-Step Replication Instructions Create / open an empty repo in VS Code. Open the AI Agent, paste the MASTER PROMPT, and let it build the project. Create and activate a virtual environment: Windows (PowerShell): py -3 -m venv .venv .venv\Scripts\Activate.ps1 macOS / Linux: python3 -m venv .venv source .venv/bin/activate Install the package (and UI extra): pip install -e ".[ui]" Run tests: pytest -q All tests should pass. Run the CLI (example): ms-simple --query demo/query_toluene_golden.csv \ --library path/to/unzipped/library_demo \ --normalize --noise_floor 1 --unique --top 5 --json Run the Streamlit app: streamlit run app.py In the browser: Upload demo/query_toluene_golden.csv as the query. Upload demo/library_demo.zip as the library ZIP. Use default settings (normalize ON, noise floor ~1%, mz_tol ~0.49, unique ON, top N ~10). Click Run match to see Top matches. Click Generate report to download report.html. Record in your lab notebook: Python version, date, and environment. Settings: normalize, noise_floor, mz_tol, unique, top N. Top hit and similarity score. Any deviations from expected behavior. Shape E. How to Put This on Your Poster You can add a small box like: Replicate this experiment Scan the QR code or visit: [UNIVERSITY URL TO PDF] Follow the MASTER PROMPT + steps to rebuild the ms-simple Trial 3 tool. Compare your top-hit scores to the reference values in the guide. Once your professor uploads the PDF to the university website, we can (in a later session) help you: craft an even shorter poster snippet, and outline a caption for the QR code like: “Scan to rebuild the tool.” For now, you’ve got the MASTER PROMPT, the instructions, and a downloadable PDF you can send her. --------------------------------CUT LINE---end of instructions---CUT LINE----------------------------- Please enjoy! -David T. Burton