TabularTaskManager

class falcon.tabular.TabularTaskManager(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any)

Default task manager for tabular data.

__init__(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any) None
Parameters
  • task (str) – tabular_classification or tabular_regression

  • data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or tuple (X,y)

  • pipeline (Optional[Type[Pipeline]]) – class to be used as pipeline, by default None. If None, SimpleTabularPipeline will be used

  • pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will overwrite the ones from default_pipeline_options attribute.

  • extra_pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will be passed in addition to the ones from default_pipeline_options attribute. This argument is ignored if pipeline_options is not None

  • features (Optional[ft.ColumnsList], optional) – names or indices of columns to be used as features, by default None. If None, all columns except the last one will be used. If target argument is not None, features should be passed explicitly as well

  • target (Optional[Union[str, int]], optional) – name or index of column to be used as target, by default None. If None, the last column will be used as target. If features argument is not None, target should be specified explicitly as well

_create_pipeline(pipeline: Optional[Type[Pipeline]], options: Optional[Dict]) None

Initializes the pipeline.

Parameters
  • pipeline (Optional[Type[Pipeline]]) – pipeline class

  • options (Optional[Dict]) – pipeline options

_prepare_data(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], training: bool = True) Tuple[ndarray[Any, dtype[ScalarType]], ndarray[Any, dtype[ScalarType]], List[ColumnTypes]]

Initial data preparation: 1) optional: read data from the specified location; 2) split into features and targets. By default it is assumed that the last column is the target; 3) clean data; 4) determine numerical and categorical features (create categorical mask).

Parameters

data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or Tuple(X,y)

Returns

tuple of features, target and type mask for features

Return type

Tuple[npt.NDArray, npt.NDArray, List[ColumnTypes]]

property default_pipeline: Type[Pipeline]

Default pipeline class.

property default_pipeline_options: Dict

Default options for pipeline.

evaluate(test_data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], silent: bool = False) Dict

Perfoms and prints the evaluation report on the given dataset.

Parameters
  • test_data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – dataset to be used for evaluation

  • silent (bool) – controls whether the metrics are printed on screen, by default False

performance_summary(test_data: Optional[Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple]]) dict

Prints a performance summary of the model. The summary always includes metrics calculated for the train set. If the train/eval split was done during training, the summary includes metrics calculated on eval set. If test set is provided as an argument, the performance includes metrics calculated on test set.

Parameters

test_data (Optional[Union[str, npt.NDArray, pd.DataFrame, Tuple]]) – data to be used as test set, by default None

Returns

metrics for each subset

Return type

dict

predict(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame]) ndarray[Any, dtype[ScalarType]]

Performs prediction on new data.

Parameters

data (Union[str, npt.NDArray, pd.DataFrame]) – path to data file or pandas dataframe or numpy array

Returns

predictions

Return type

npt.NDArray

predict_stored_subset(subset: str = 'train') ndarray[Any, dtype[ScalarType]]

Makes a prediction on a stored subset (train or eval).

Parameters

subset (str, optional) – subset to predict on (train or eval), by default ‘train’

Returns

predicted values

Return type

npt.NDArray

save_model(filename: Optional[str] = None, **kwargs: Any) ModelProto

Serializes and saves the model.

Parameters

filename (Optional[str], optional) – filename for the model file, by default None. If filename is not specified, the model is not saved on disk and only returned as bytes object

Returns

ONNX ModelProto of the model

Return type

ModelProto

train(make_eval_subset: bool = True, pre_eval: bool = False, **kwargs: Any) TabularTaskManager

Invokes the training procedure of an underlying pipeline. Print an expected model performance if available.

Parameters
  • pre_eval (bool) – if True, first estimate model perfromance via 10 folds CV for small datasets or 25% test split for large datasets, by default False. Setting pre_eval = True is not reccomended as it pre-evaluates the pipeline as a whole which has lots of random elements therefore the results might be non reproducable

  • make_eval_subset (bool) – controls whether a dedicated eval set should be allocated for performance report, by default True. If True, overwrites the value of pre_eval to False

Returns

self

Return type

TabularTaskManager