Getting started

Train a powerful Machine Learning model in a single line of code with Falcon!

Falcon is a simple and lightweight AutoML library designed for people who want to train a model on a custom dataset in an instant even without specific data-science knowledge. Simply give Falcon your dataset and specify which feature you want the ML model to predict. Falcon will do the rest!

Falcon allows the trained models to be immediately used in production by saving them in the widely used ONNX format. No need to write custom code to save complicated models to ONNX anymore!

Installation

Stable release from PyPi

pip install falcon-ml

Latest version from GitHub

pip install git+https://github.com/OKUA1/falcon

Installing some of the dependencies on Apple Silicon Macs might not work, the workaround is to create an X86 environment using Conda

conda create -n falcon_env
conda activate falcon_env
conda config --env --set subdir osx-64
conda install python=3.9
pip3 install falcon-ml

Usage

Currently, Falcon supports only tabular datasets and two tasks: ‘tabular_classification’ and ‘tabular_regression’.

The easiest way to use the library is by using the highest level API as shown below:

from falcon import AutoML

AutoML(task = 'tabular_classification', train_data = 'titanic.csv')

This single line of code will read and prepare the dataset, scale/encode the features, encode the labels, train the model and save it as ONNX file for future inference.

Additionally, it is also possible to explicitly specify the feature/target columns (otherwise the last column will be used as target and all other as features) and test data (otherwise 25% of training set will be kept) for evaluation report.

from falcon import AutoML

manager = AutoML(task = 'tabular_classification', train_data = 'titanic.csv', test_data = 'titanic_test.csv', features = ['sex', 'gender', 'class', 'age'], target = 'survived')

It is also possible to provide train/test data as a pandas dataframe, numpy array, or tuple containing X and y. In order to do that, simply pass the required object as an argument. This might be relevant in cases when custom data preparation is needed or data itself comes from non-conventional source.

from falcon import AutoML
import pandas as pd

df = pd.read_csv('titanic.csv')
X_test = pd.read_csv('X_test.csv')
y_test = pd.read_csv('y_test.csv')

manager = AutoML(task = 'tabular_classification', train_data = df, test_data = (X_test, y_test), features = ['sex', 'gender', 'class', 'age'], target = 'survived')

While AutoML function enables extremely fast experimentation, it does not provide enough control over the training steps and might be not flexible enough for more advanced users. As an alternative, it is possible to use the relevant TaskManager class either directly or by using initialize helper function.

from falcon import initialize
import pandas as pd

test_df = pd.read_csv('titanic_test.csv')

manager = initialize(task='tabular_classification', data='titanic.csv')
manager.train(make_eval_subset = True)
manager.performance_summary(test_df)

When using initialize function it is also possible to provide a custom configuration or even a custom pipeline. For more details please check the API reference section.

Demo datasets

You can try out falcon using one of the built-in demo datasets.

from falcon import AutoML
from falcon.datasets import load_churn_dataset, load_insurance_dataset # churn -> classification; insurance -> regression

df = load_churn_dataset()

AutoML(task = 'tabular_classification', train_data = df)

Making predictions with trained models

There are 2 ways to make a prediction using a trained model. If the input/unlabeled data is available right away, the same manager object that was used for training the model can be used. An important thing to notice is that the input data should have the same structure as the training set (the same number, order and type of the features). This is assumed by the model, but not explicitly checked during runtime. The recommended approach is to provide the data as a numpy array.

from falcon import AutoML
import pandas as pd

df = pd.read_csv('training_data.csv')
manager = AutoML(task = 'tabular_classification', train_data = df)

unlabeled_data = pd.read_csv('unlabeled_data.csv').to_numpy()
predictions = manager.predict(unlabeled_data)
print(predictions)

While this solution is straight-forward, in real-world applications the new/unlabeled data is not always available right away. Therefore, it is desirable to train a model and reuse it in the future.

One of the key features of falcon is native ONNX support. ONNX (Open Neural Network Exchange) is an open standard for representing machine learning algorithms. This means that once the model is exported to ONNX, it can be run on any platform with available ONNX implementation. For example, Microsoft ONNX Rutime (ORT) is available for Python, C, C++, Java, JavaScript and multiple other languages which allows to run the model virtually everywhere. There are also alternative implementations, but there is a high chance they do not support all the required operators.

In order to simplify the interaction with ONNX Runtime, falcon provides a run_model function that takes the path to the ONNX model, the input data as a numpy array and returns the predictions.

from falcon import run_model
import pandas as pd

unlabeled_data = pd.read_csv('unlabeled_data.csv').to_numpy() # ONLY NUMPY ARRAYS ARE ACCEPTED AS INPUT !!!

predictions = run_model(model_path = "/path/to/model.onnx", X = unlabeled_data)

print(predictions)

Below is the complete example of model training and inference using the built-in datasets.

############################################ training.py ###########################################################
from falcon import AutoML
from falcon.datasets import load_churn_dataset

df = load_churn_dataset(mode = "training")
AutoML(task = "tabular_classification", train_data = df)
# onnx model name will be printed after the training is done, use it instead of <FILENAME> during infernce

############################################ inference.py ##########################################################
from falcon import run_model
from falcon.datasets import load_churn_dataset

X = load_churn_dataset(mode = "inference") # for this example we are reusing training dataset but without labels
predictions = run_model(model_path = "<FILENAME>.onnx", X = X)
print(predictions)

Manually selecting a configuration

All of the examples in the previous sections demonstrated how to train falcon models using the default configuration. However, there are several configurations available and it is easily possible to switch between them by providing a single additional argument.

For tabular classification task, by default, falcon will use a SuperLearner and the sub-configuration (e.g. list of base estimators) will be chosen automatically based on the dataset size. But if we want to specify that a ‘mini’ sub-configuration of the learner is to be used, we can do it by adding config = ‘SuperLearner.mini’.

AutoML(task = "tabular_classification", train_data = df, config = 'SuperLearner.mini') # SuperLearner.mini config is used

Similarly, instead of SuperLearner which builds a stacking ensemble of base estimators, it is possible to use OptunaLearner which uses a single model and performs hyperparameter optimization using the Optuna framework.

AutoML(task = "tabular_classification", train_data = df, config = 'OptunaLearner') # OptunaLearner config is used

All the available configurations can be found here.