=====================
Lexer Plugins
=====================

This project's “lexer” is a lightweight Fortran *parser* that extracts enough
structure to drive directives and cross-references (it is not intended to be a
full compiler frontend).

This page explains:

- where parsing starts in the Sphinx build,
- what your lexer must return,
- how to populate the Fortran object model used by this project,
- how to register your lexer as a plugin.


Where Parsing Starts
====================

Parsing runs automatically during a Sphinx build:

1. Sphinx loads extensions listed in ``conf.py``.
2. This extension registers an event handler on ``builder-inited``.
3. When the builder initializes, the handler:

   - collects Fortran source files from ``fortran_sources`` (and applies
     ``fortran_sources_exclude``),
   - derives doc markers from ``fortran_doc_chars``,
   - resolves the configured lexer name via the lexer registry,
   - calls ``lexer.parse(files, doc_markers=...)``,
   - stores the returned symbols into the ``f`` domain data.

After that, directives (e.g. ``.. f:module::``) render by reading the stored
symbols from the domain.


The Lexer Interface
===================

A lexer is any object implementing the ``FortranLexer`` protocol defined in
``sphinx_fortran_domain.lexers``.

At minimum:

- ``name``: a string identifier
- ``parse(file_paths, *, doc_markers) -> FortranParseResult``

Example skeleton:

.. code-block:: python

   from __future__ import annotations

   from typing import Sequence

   from sphinx_fortran_domain.lexers import (
       FortranLexer,
       FortranParseResult,
       FortranModuleInfo,
   )
   from sphinx_fortran_domain.utils import read_text_utf8


   class MyLexer(FortranLexer):
       name = "my-lexer"

       def parse(self, file_paths: Sequence[str], *, doc_markers: Sequence[str]) -> FortranParseResult:
           modules = {}
           submodules = {}
           programs = {}

           for path in file_paths:
               text = read_text_utf8(path)
               # ... parse text ...
               # modules["mymodule"] = FortranModuleInfo(name="mymodule", doc="...", ...)

           return FortranParseResult(modules=modules, submodules=submodules, programs=programs)

The ``doc_markers`` argument is the *resolved* marker list (examples: ``["!>"]``,
``["!>", "!!"]``). It is your lexer's responsibility to interpret those markers
and extract doc text.


Reusable Helpers (Recommended)
==============================

If you are writing a plugin lexer, you can reuse a few small helper functions
provided by this project instead of re-implementing common logic.

These helpers live in ``sphinx_fortran_domain.utils``:

* ``read_text_utf8(path)`` / ``read_lines_utf8(path)``: consistent UTF-8 reads with ``errors="replace"``.
* ``extract_predoc_before_line(lines, idx, doc_markers=...)``: grab the contiguous doc block immediately *above* a definition line.
* ``is_doc_line(line, doc_markers)`` and ``find_inline_doc(line, doc_markers)``: detect doc markers in whole-line docs and inline comment docs.
* ``strip_inline_comment(line)``: best-effort removal of ``! ...`` trailing comments.

These helpers can also be useful:

* ``doc_markers_from_doc_chars(fortran_doc_chars)``: convert config values like ``['>']`` into concrete markers like ``['!>']``.
* ``collect_fortran_source_files_from_config(confdir, config)``: collect Fortran sources using the standard config attribute names.


Populating the Fortran Object Model
===================================

Your ``parse`` method returns a ``FortranParseResult``:

- ``modules``: mapping ``module_name -> FortranModuleInfo``
- ``submodules``: mapping ``submodule_name -> FortranSubmoduleInfo``
- ``programs``: mapping ``program_name -> FortranProgramInfo``

The domain stores these mappings and directives query them by exact key.

Naming
------

Use stable, human-facing names as keys (typically the Fortran entity name as
written in source). Cross-references like ``:f:mod:`foo``` resolve against these
names.

Locations
---------

Most object types accept an optional ``SourceLocation(path, lineno)``.
Providing locations is recommended because it enables:

- better debugging,
- better program-source extraction in directives,
- future improvements (source links, jump-to-definition, etc.).

Use 1-based line numbers.

Docs
----

Most objects accept a ``doc: str | None``. That string is treated as a
reStructuredText fragment by this project's directives.

Practical guidance:

- Store *plain text* without the leading doc marker.
- Keep line breaks as you want them rendered.
- You may include Sphinx roles/directives (e.g. ``:math:`...``` or ``.. math::``).

This project also supports a lightweight doc convention that is normalized
before parsing as reST (for example, ``## Title`` becomes a rubric). If you
extract docs from source, you can preserve those markers.


Objects You Can Produce
=======================

Below is a simplified overview of the main dataclasses.

Module
------

.. code-block:: python

   from sphinx_fortran_domain.lexers import FortranModuleInfo, SourceLocation

   mod = FortranModuleInfo(
       name="mymodule",
       doc="""Module documentation.""",
       procedures=[...],
       types=[...],
       interfaces=[...],
       location=SourceLocation(path="/abs/or/rel/path.f90", lineno=1),
   )

Procedures (functions/subroutines)
----------------------------------

Procedures are represented as ``FortranProcedure`` objects.

- ``kind`` must be either ``"function"`` or ``"subroutine"``.
- ``arguments`` is a sequence of ``FortranArgument``.
- For functions, optionally set ``result`` (a ``FortranArgument``) to document the
  result variable.

Derived types
-------------

Derived types are represented by ``FortranType``.

- components/attributes: ``FortranComponent``
- type-bound procedures: ``FortranTypeBoundProcedure``

Programs
--------

Programs are represented by ``FortranProgramInfo``.

If you can provide them:

- ``dependencies``: modules referenced via ``use`` statements
- ``procedures``: internal procedures after ``contains``
- ``source``: raw source string for the program unit

All of these are optional; you can start with just ``name``/``doc``/``location``.


Registering a Lexer Plugin
==========================

A plugin is typically a small Python package that exposes a Sphinx extension.
In its ``setup(app)`` function, register a lexer factory:

.. code-block:: python

   from sphinx_fortran_domain.lexers import register_lexer

   def setup(app):
       register_lexer("my-lexer", lambda: MyLexer())
       return {"version": "0.1.0", "parallel_read_safe": True}

Then, in the consuming project's ``conf.py``:

.. code-block:: python

   extensions = [
       "my_fortran_lexer_plugin",  # registers the lexer
       "sphinx_fortran_domain",
   ]

   fortran_lexer = "my-lexer"


Reference Plugin (Concrete Example)
===================================

This section shows a minimal, end-to-end "reference plugin" that you can copy
as a starting point.

It consists of:

* a Python module that registers a lexer via ``register_lexer``
* a project ``conf.py`` that enables the plugin and selects the lexer
* a minimal Fortran source file that the lexer parses

The plugin module
-----------------

Create a module (or package) that Sphinx can import, for example
``reference_plugin.py``:

For a fully working (tested) reference implementation, see
``tests/fixtures/reference_plugin.py`` in this repository.

.. code-block:: python

   from __future__ import annotations

    import re
    from typing import Dict, Sequence

   from sphinx_fortran_domain.lexers import (
       FortranLexer,
       FortranModuleInfo,
       FortranParseResult,
       SourceLocation,
       register_lexer,
   )

   from sphinx_fortran_domain.utils import extract_predoc_before_line, read_lines_utf8


   _RE_MODULE = re.compile(r"^\s*module\s+(?!procedure\b)([A-Za-z_]\w*)\b", re.IGNORECASE)


   class ReferenceLexer(FortranLexer):
       name = "reference"

       def parse(self, file_paths: Sequence[str], *, doc_markers: Sequence[str]) -> FortranParseResult:
           modules: Dict[str, FortranModuleInfo] = {}
           for file_path in file_paths:
               lines = read_lines_utf8(file_path)
               for idx, line in enumerate(lines):
                   m = _RE_MODULE.match(line)
                   if not m:
                       continue
                   name = m.group(1)
                   doc = extract_predoc_before_line(lines, idx, doc_markers=doc_markers)
                   modules[name] = FortranModuleInfo(
                       name=name,
                       doc=doc,
                       location=SourceLocation(path=str(file_path), lineno=idx + 1),
                   )
                   break
           return FortranParseResult(modules=modules, submodules={}, programs={})


   def setup(app):
       # The registry stores a factory; Sphinx will call it when parsing starts.
       register_lexer("reference", lambda: ReferenceLexer())
       return {"version": "0.1.0", "parallel_read_safe": True}


Enable it in a consuming project's conf.py
------------------------------------------

In the consuming project (the docs you are building), enable both the plugin
and this domain:

.. code-block:: python

   extensions = [
       "reference_plugin",      # registers the lexer
       "sphinx_fortran_domain", # uses the registered lexer
   ]

   fortran_lexer = "reference"


Minimal Fortran input
---------------------

With the default doc marker ``!>``, a minimal module might look like:

.. code-block:: fortran

   !> This is a test module.
   module mymod
   end module mymod

When you run Sphinx, parsing starts at ``builder-inited`` and your lexer is
invoked with the collected ``fortran_sources``.