Adult income: JSON input with mixed feature types#

Most real tabular models have categorical features. A positional JSON array of floats falls apart the moment one of those fields is "Bachelors" instead of a number; you have to map strings to integers first. Tutorial 02 handled numeric normalisation; this tutorial handles the harder case: a model that mixes string categoricals with numerics, and a client that wants to send a plain JSON object without knowing the encoding tables.

Edgeflow handles this with named-input mode: the client sends JSON with named fields, the server applies the encoding tables stored in schema.json to produce a flat float tensor, and the model sees the same shape it saw during training.

You will:

  1. Train an XGBoost classifier on the UCI Adult Income dataset, with OrdinalEncoder for categoricals.

  2. Push the column transformer to edgeflow so its encoding tables become part of the deployment.

  3. Hit the model with a JSON request like a real API client would.

Prerequisites#

  • Edgeflow running via docker compose (see tutorial 01).

  • Python 3.12+ and uv.

1. Train and deploy#

curl -O https://raw.githubusercontent.com/jordandelbar/edgeflow/main/examples/03-adult-income/train.py
uv run train.py

The script:

  • Pulls the UCI Adult Income CSV directly from archive.ics.uci.edu.

  • Splits train/test, builds a ColumnTransformer with OrdinalEncoder for the 8 categorical columns and passthrough for the 6 numerical columns.

  • Trains an XGBClassifier.

  • Calls edgeflow.log_model with both the ONNX model and the column transformer, so its encoding tables are written into schema.json.

The column transformer is the hinge of named-input mode:

preprocessor = ColumnTransformer(
    [
        (
            "cat",
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            categorical_cols,
        ),
        ("num", "passthrough", numerical_cols),
    ]
)

Edgeflow introspects this object to derive the per-field encoding tables. The log_model call passes it alongside the ONNX bytes:

edgeflow.log_model(
    model_bytes=clf_to_onnx(clf),
    postprocess=edgeflow.Pipeline(
        [edgeflow.ClassifierOutput(labels=["<=50K", ">50K"])]
    ),
    column_transformer=preprocessor,
)

Expected output:

model type: xgboost
fetching adult income dataset from https://archive.ics.uci.edu/...
dataset: 32,561 rows, 14 features
class balance: 24.1% >50K
training xgboost...
F1: 0.7095  AUC-ROC: 0.9285
pushing to edgeflow at http://localhost:5000...

2. Send a JSON request#

curl -s -X POST http://localhost:8080/infer \
  -H "content-type: application/json" \
  -d '{
    "workclass": "Private",
    "education": "Bachelors",
    "marital-status": "Married-civ-spouse",
    "occupation": "Exec-managerial",
    "relationship": "Husband",
    "race": "White",
    "sex": "Male",
    "native-country": "United-States",
    "age": 45,
    "fnlwgt": 200000,
    "education-num": 13,
    "capital-gain": 0,
    "capital-loss": 0,
    "hours-per-week": 40
  }'

You get back the predicted label (>50K or <=50K) along with the class probabilities.

What just happened?#

When you called log_model, edgeflow introspected the ColumnTransformer and wrote each field’s dtype and encoding into a schema.json artifact bundled with the ONNX bytes: an ordinal map for categoricals, passthrough for numerics.

When the inference pod loaded that artifact, the schema told it to expect JSON objects keyed by field name (named-input mode) rather than the positional float array tutorials 01 and 02 used. On each request the server parses the JSON, looks up each categorical value in its encoding table, and assembles a flat f32 tensor in the order the model expects - all before the ONNX session sees a single byte.

Unknown categories#

The encoder is configured with unknown_value=-1. Send "workclass": "ImaginaryJob" and the request still succeeds; the model just sees -1 for that feature. This matters in production: real client data has values you have never seen during training, and silently failing closed beats a 500 error.

Next steps#