TabularTaskManager

class falcon.tabular.TabularTaskManager(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any)

Default task manager for tabular data.

__init__(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any) → None

Parameters

task (str) – tabular_classification or tabular_regression
data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or tuple (X,y)
pipeline (Optional[Type[Pipeline]]) – class to be used as pipeline, by default None. If None, SimpleTabularPipeline will be used
pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will overwrite the ones from default_pipeline_options attribute.
extra_pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will be passed in addition to the ones from default_pipeline_options attribute. This argument is ignored if pipeline_options is not None
features (Optional[ft.ColumnsList], optional) – names or indices of columns to be used as features, by default None. If None, all columns except the last one will be used. If target argument is not None, features should be passed explicitly as well
target (Optional[Union[str, int]], optional) – name or index of column to be used as target, by default None. If None, the last column will be used as target. If features argument is not None, target should be specified explicitly as well

_create_pipeline(pipeline: Optional[Type[Pipeline]], options: Optional[Dict]) → None

Initializes the pipeline.

Parameters

pipeline (Optional[Type[Pipeline]]) – pipeline class
options (Optional[Dict]) – pipeline options

_prepare_data(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], training: bool = True) → Tuple[ndarray[Any, dtype[ScalarType]], ndarray[Any, dtype[ScalarType]], List[ColumnTypes]]

Initial data preparation: 1) optional: read data from the specified location; 2) split into features and targets. By default it is assumed that the last column is the target; 3) clean data; 4) determine numerical and categorical features (create categorical mask).

Parameters: data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or Tuple(X,y)
Returns: tuple of features, target and type mask for features
Return type: Tuple[npt.NDArray, npt.NDArray, List[ColumnTypes]]

property default_pipeline: Type[Pipeline]: Default pipeline class.

property default_pipeline_options: Dict: Default options for pipeline.

evaluate(test_data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], silent: bool = False) → Dict

Perfoms and prints the evaluation report on the given dataset.

Parameters

test_data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – dataset to be used for evaluation
silent (bool) – controls whether the metrics are printed on screen, by default False

performance_summary(test_data: Optional[Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple]]) → dict

Prints a performance summary of the model. The summary always includes metrics calculated for the train set. If the train/eval split was done during training, the summary includes metrics calculated on eval set. If test set is provided as an argument, the performance includes metrics calculated on test set.

Parameters: test_data (Optional[Union[str, npt.NDArray, pd.DataFrame, Tuple]]) – data to be used as test set, by default None
Returns: metrics for each subset
Return type: dict

predict(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame]) → ndarray[Any, dtype[ScalarType]]

Performs prediction on new data.

Parameters: data (Union[str, npt.NDArray, pd.DataFrame]) – path to data file or pandas dataframe or numpy array
Returns: predictions
Return type: npt.NDArray

predict_stored_subset(subset: str = 'train') → ndarray[Any, dtype[ScalarType]]

Makes a prediction on a stored subset (train or eval).

Parameters: subset (str, optional) – subset to predict on (train or eval), by default ‘train’
Returns: predicted values
Return type: npt.NDArray

save_model(filename: Optional[str] = None, **kwargs: Any) → ModelProto

Serializes and saves the model.

Parameters: filename (Optional[str], optional) – filename for the model file, by default None. If filename is not specified, the model is not saved on disk and only returned as bytes object
Returns: ONNX ModelProto of the model
Return type: ModelProto

train(make_eval_subset: bool = True, pre_eval: bool = False, **kwargs: Any) → TabularTaskManager

Invokes the training procedure of an underlying pipeline. Print an expected model performance if available.

Parameters

pre_eval (bool) – if True, first estimate model perfromance via 10 folds CV for small datasets or 25% test split for large datasets, by default False. Setting pre_eval = True is not reccomended as it pre-evaluates the pipeline as a whole which has lots of random elements therefore the results might be non reproducable
make_eval_subset (bool) – controls whether a dedicated eval set should be allocated for performance report, by default True. If True, overwrites the value of pre_eval to False

Returns

self

Return type

TabularTaskManager