Getting started
Train a powerful Machine Learning model in a single line of code with Falcon!
Falcon is a simple and lightweight AutoML library designed for people who want to train a model on a custom dataset in an instant even without specific data-science knowledge. Simply give Falcon your dataset and specify which feature you want the ML model to predict. Falcon will do the rest!
Falcon allows the trained models to be immediately used in production by saving them in the widely used ONNX format. No need to write custom code to save complicated models to ONNX anymore!
Installation
Stable release from PyPi
pip install falcon-ml
Latest version from GitHub
pip install git+https://github.com/OKUA1/falcon
Installing some of the dependencies on Apple Silicon Macs might not work, the workaround is to create an X86 environment using Conda
conda create -n falcon_env
conda activate falcon_env
conda config --env --set subdir osx-64
conda install python=3.9
pip3 install falcon-ml
Usage
Currently, Falcon supports only tabular datasets and two tasks: ‘tabular_classification’ and ‘tabular_regression’.
The easiest way to use the library is by using the highest level API as shown below:
from falcon import AutoML
AutoML(task = 'tabular_classification', train_data = 'titanic.csv')
This single line of code will read and prepare the dataset, scale/encode the features, encode the labels, train the model and save it as ONNX file for future inference.
Additionally, it is also possible to explicitly specify the feature/target columns (otherwise the last column will be used as target and all other as features) and test data (otherwise 25% of training set will be kept) for evaluation report.
from falcon import AutoML
manager = AutoML(task = 'tabular_classification', train_data = 'titanic.csv', test_data = 'titanic_test.csv', features = ['sex', 'gender', 'class', 'age'], target = 'survived')
It is also possible to provide train/test data as a pandas dataframe, numpy array, or tuple containing X and y. In order to do that, simply pass the required object as an argument. This might be relevant in cases when custom data preparation is needed or data itself comes from non-conventional source.
from falcon import AutoML
import pandas as pd
df = pd.read_csv('titanic.csv')
X_test = pd.read_csv('X_test.csv')
y_test = pd.read_csv('y_test.csv')
manager = AutoML(task = 'tabular_classification', train_data = df, test_data = (X_test, y_test), features = ['sex', 'gender', 'class', 'age'], target = 'survived')
While AutoML function enables extremely fast experimentation, it does not provide enough control over the training steps and might be not flexible enough for more advanced users. As an alternative, it is possible to use the relevant TaskManager class either directly or by using initialize
helper function.
from falcon import initialize
import pandas as pd
test_df = pd.read_csv('titanic_test.csv')
manager = initialize(task='tabular_classification', data='titanic.csv')
manager.train(make_eval_subset = True)
manager.performance_summary(test_df)
When using initialize
function it is also possible to provide a custom configuration or even a custom pipeline. For more details please check the API reference section.
Demo datasets
You can try out falcon using one of the built-in demo datasets.
from falcon import AutoML
from falcon.datasets import load_churn_dataset, load_insurance_dataset # churn -> classification; insurance -> regression
df = load_churn_dataset()
AutoML(task = 'tabular_classification', train_data = df)
Making predictions with trained models
There are 2 ways to make a prediction using a trained model. If the input/unlabeled data is available right away, the same manager object that was used for training the model can be used. An important thing to notice is that the input data should have the same structure as the training set (the same number, order and type of the features). This is assumed by the model, but not explicitly checked during runtime. The recommended approach is to provide the data as a numpy array.
from falcon import AutoML
import pandas as pd
df = pd.read_csv('training_data.csv')
manager = AutoML(task = 'tabular_classification', train_data = df)
unlabeled_data = pd.read_csv('unlabeled_data.csv').to_numpy()
predictions = manager.predict(unlabeled_data)
print(predictions)
While this solution is straight-forward, in real-world applications the new/unlabeled data is not always available right away. Therefore, it is desirable to train a model and reuse it in the future.
One of the key features of falcon is native ONNX support. ONNX (Open Neural Network Exchange) is an open standard for representing machine learning algorithms. This means that once the model is exported to ONNX, it can be run on any platform with available ONNX implementation. For example, Microsoft ONNX Rutime (ORT) is available for Python, C, C++, Java, JavaScript and multiple other languages which allows to run the model virtually everywhere. There are also alternative implementations, but there is a high chance they do not support all the required operators.
In order to simplify the interaction with ONNX Runtime, falcon provides a run_model function that takes the path to the ONNX model, the input data as a numpy array and returns the predictions.
from falcon import run_model
import pandas as pd
unlabeled_data = pd.read_csv('unlabeled_data.csv').to_numpy() # ONLY NUMPY ARRAYS ARE ACCEPTED AS INPUT !!!
predictions = run_model(model_path = "/path/to/model.onnx", X = unlabeled_data)
print(predictions)
Below is the complete example of model training and inference using the built-in datasets.
############################################ training.py ###########################################################
from falcon import AutoML
from falcon.datasets import load_churn_dataset
df = load_churn_dataset(mode = "training")
AutoML(task = "tabular_classification", train_data = df)
# onnx model name will be printed after the training is done, use it instead of <FILENAME> during infernce
############################################ inference.py ##########################################################
from falcon import run_model
from falcon.datasets import load_churn_dataset
X = load_churn_dataset(mode = "inference") # for this example we are reusing training dataset but without labels
predictions = run_model(model_path = "<FILENAME>.onnx", X = X)
print(predictions)
Manually selecting a configuration
All of the examples in the previous sections demonstrated how to train falcon models using the default configuration. However, there are several configurations available and it is easily possible to switch between them by providing a single additional argument.
For tabular classification task, by default, falcon will use a SuperLearner and the sub-configuration (e.g. list of base estimators) will be chosen automatically based on the dataset size. But if we want to specify that a ‘mini’ sub-configuration of the learner is to be used, we can do it by adding config = ‘SuperLearner.mini’.
AutoML(task = "tabular_classification", train_data = df, config = 'SuperLearner.mini') # SuperLearner.mini config is used
Similarly, instead of SuperLearner which builds a stacking ensemble of base estimators, it is possible to use OptunaLearner which uses a single model and performs hyperparameter optimization using the Optuna framework.
AutoML(task = "tabular_classification", train_data = df, config = 'OptunaLearner') # OptunaLearner config is used
All the available configurations can be found here.