Adult income: JSON input with mixed feature types
=================================================

Most real tabular models have categorical features. A positional JSON
array of floats falls apart the moment one of those fields is
``"Bachelors"`` instead of a number; you have to map strings to integers
first. Tutorial 02 handled numeric normalisation; this tutorial handles
the harder case: a model that mixes string categoricals with numerics,
and a client that wants to send a plain JSON object without knowing the
encoding tables.

Edgeflow handles this with **named-input mode**: the client sends JSON
with named fields, the server applies the encoding tables stored in
``schema.json`` to produce a flat float tensor, and the model sees the
same shape it saw during training.

You will:

1. Train an XGBoost classifier on the UCI Adult Income dataset, with
   ``OrdinalEncoder`` for categoricals.
2. Push the column transformer to edgeflow so its encoding tables
   become part of the deployment.
3. Hit the model with a JSON request like a real API client would.

Prerequisites
-------------

- Edgeflow running via docker compose (see tutorial 01).
- Python 3.12+ and ``uv``.

1. Train and deploy
-------------------

.. code-block:: bash

   curl -O https://raw.githubusercontent.com/jordandelbar/edgeflow/main/examples/03-adult-income/train.py
   uv run train.py

The script:

- Pulls the UCI Adult Income CSV directly from
  ``archive.ics.uci.edu``.
- Splits train/test, builds a ``ColumnTransformer`` with
  ``OrdinalEncoder`` for the 8 categorical columns and passthrough for
  the 6 numerical columns.
- Trains an ``XGBClassifier``.
- Calls ``edgeflow.log_model`` with both the ONNX model **and** the
  column transformer, so its encoding tables are written into
  ``schema.json``.

The column transformer is the hinge of named-input mode:

.. literalinclude:: ../../../examples/03-adult-income/train.py
   :language: python
   :start-after: # [docs:start:column-transformer]
   :end-before: # [docs:end:column-transformer]
   :dedent:

Edgeflow introspects this object to derive the per-field encoding
tables. The ``log_model`` call passes it alongside the ONNX bytes:

.. literalinclude:: ../../../examples/03-adult-income/train.py
   :language: python
   :start-after: # [docs:start:log-model]
   :end-before: # [docs:end:log-model]
   :dedent:

Expected output:

.. code-block:: text

   model type: xgboost
   fetching adult income dataset from https://archive.ics.uci.edu/...
   dataset: 32,561 rows, 14 features
   class balance: 24.1% >50K
   training xgboost...
   F1: 0.7095  AUC-ROC: 0.9285
   pushing to edgeflow at http://localhost:5000...

2. Send a JSON request
----------------------

.. code-block:: bash

   curl -s -X POST http://localhost:8080/infer \
     -H "content-type: application/json" \
     -d '{
       "workclass": "Private",
       "education": "Bachelors",
       "marital-status": "Married-civ-spouse",
       "occupation": "Exec-managerial",
       "relationship": "Husband",
       "race": "White",
       "sex": "Male",
       "native-country": "United-States",
       "age": 45,
       "fnlwgt": 200000,
       "education-num": 13,
       "capital-gain": 0,
       "capital-loss": 0,
       "hours-per-week": 40
     }'

You get back the predicted label (``>50K`` or ``<=50K``) along with the
class probabilities.

What just happened?
-------------------

When you called ``log_model``, edgeflow introspected the
``ColumnTransformer`` and wrote each field's dtype and encoding into
a ``schema.json`` artifact bundled with the ONNX bytes: an ordinal
map for categoricals, passthrough for numerics.

When the inference pod loaded that artifact, the schema told it to
expect JSON objects keyed by field name (named-input mode) rather
than the positional float array tutorials 01 and 02 used. On each
request the server parses the JSON, looks up each categorical value
in its encoding table, and assembles a flat ``f32`` tensor in the
order the model expects - all before the ONNX session sees a single
byte.

Unknown categories
------------------

The encoder is configured with ``unknown_value=-1``. Send
``"workclass": "ImaginaryJob"`` and the request still succeeds; the
model just sees ``-1`` for that feature. This matters in production:
real client data has values you have never seen during training, and
silently failing closed beats a 500 error.

Next steps
----------

- :doc:`05-k3d-yolo` - image inputs (raw JPEG/PNG bytes), WASM
  pre-transform that decodes and resizes, postprocess that runs NMS.