Track ML and deep learning training runs

Article
03/27/2024

The MLflow tracking component lets you log source properties, parameters, metrics, tags, and artifacts related to training a machine learning or deep learning model. To get started with MLflow, try one of the MLflow quickstart tutorials.

MLflow tracking with experiments and runs

MLflow tracking is based on two concepts, experiments and runs:

Note

Starting March 27, 2024, MLflow imposes a quota limit on the number of total parameters, tags, and metric steps for all existing and new runs, and the number of total runs for all existing and new experiments, see Resource limits. If you hit the runs per experiment quota, Databricks recommends you delete runs that you no longer need using the delete runs API in Python. If you hit other quota limits, Databricks recommends adjusting your logging strategy to keep under the limit. If you require an increase to this limit, reach out to your Databricks account team with a brief explanation of your use case, why the suggested mitigation approaches do not work, and the new limit you request.

An MLflow experiment is the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.
An MLflow run corresponds to a single execution of model code.
Organize training runs with MLflow experiments
Manage training code with MLflow runs

The MLflow Tracking API logs parameters, metrics, tags, and artifacts from a model run. The Tracking API communicates with an MLflow tracking server. When you use Databricks, a Databricks-hosted tracking server logs the data. The hosted MLflow tracking server has Python, Java, and R APIs.

Note

MLflow is installed on Databricks Runtime ML clusters. To use MLflow on a Databricks Runtime cluster, you must install the mlflow library. For instructions on installing a library onto a cluster, see Install a library on a cluster. The specific packages to install for MLflow are:

For Python, select Library Source PyPI and enter mlflow in the Package field.
For R, select Library Source CRAN and enter mlflow in the Package field.
For Scala, install these two packages:
- Select Library Source Maven and enter org.mlflow:mlflow-client:1.11.0 in the Coordinates field.
- Select Library Source PyPI and enter mlflow in the Package field.

Where MLflow runs are logged

All MLflow runs are logged to the active experiment, which can be set using any of the following ways:

Use the mlflow.set_experiment() command.
Use the experiment_id parameter in the mlflow.start_run() command.
Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or MLFLOW_EXPERIMENT_ID.

If no active experiment is set, runs are logged to the notebook experiment.

To log your experiment results to a remotely hosted MLflow Tracking server in a workspace other than the one in which you are running your experiment, set the tracking URI to reference the remote workspace with mlflow.set_tracking_uri(), and set the path to your experiment in the remote workspace by using mlflow.set_experiment().

mlflow.set_tracking_uri(<uri-of-remote-workspace>)
mlflow.set_experiment("path to experiment in remote workspace")

If you are running experiments locally and want to log experiment results to the Databricks MLflow Tracking server, provide your Databricks workspace instance (DATABRICKS_HOST) and Databricks personal access token (DATABRICKS_TOKEN). Next, you can set the tracking URI to reference the workspace with mlflow.set_tracking_uri(), and set the path to your experiment by using mlflow.set_experiment(). See Perform Azure Databricks personal access token authentication for details on where to find values for the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables.

The following code example demonstrates setting these values:


os.environ["DATABRICKS_HOST"] = "https://dbc-1234567890123456.cloud.databricks.com" # set to your server URI
os.environ["DATABRICKS_TOKEN"] = "dapixxxxxxxxxxxxx"

mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/your-experiment")

Logging example notebook

This notebook shows how to log runs to a notebook experiment and to a workspace experiment. Only MLflow runs initiated within a notebook can be logged to the notebook experiment. MLflow runs launched from any notebook or from the APIs can be logged to a workspace experiment. For information about viewing logged runs, see View notebook experiment and View workspace experiment.

Log MLflow runs notebook

Get notebook

You can use MLflow Python, Java or Scala, and R APIs to start runs and record run data. For details, see the MLflow example notebooks.

Access the MLflow tracking server from outside Azure Databricks

You can also write to and read from the tracking server from outside Azure Databricks, for example using the MLflow CLI. See Access the MLflow tracking server from outside Azure Databricks.

Analyze MLflow runs programmatically

You can access MLflow run data programmatically using the following two DataFrame APIs:

The MLflow Python client search_runs API returns a pandas DataFrame.
The MLflow experiment data source returns an Apache Spark DataFrame.

This example demonstrates how to use the MLflow Python client to build a dashboard that visualizes changes in evaluation metrics over time, tracks the number of runs started by a specific user, and measures the total number of runs across all users:

Build dashboards with the MLflow Search API

Why model training metrics and outputs may vary

Many of the algorithms used in ML have a random element, such as sampling or random initial conditions within the algorithm itself. When you train a model using one of these algorithms, the results might not be the same with each run, even if you start the run with the same conditions. Many libraries offer a seeding mechanism to fix the initial conditions for these stochastic elements. However, there may be other sources of variation that are not controlled by seeds. Some algorithms are sensitive to the order of the data, and distributed ML algorithms may also be affected by how the data is partitioned. Generally this variation is not significant and not important in the model development process.

To control variation caused by differences in ordering and partitioning, use the PySpark functions repartition and sortWithinPartitions.

MLflow tracking examples

The following notebooks demonstrate how to train several types of models and track the training data in MLflow and how to store tracking data in Delta Lake.