TabularTaskManager
- class falcon.tabular.TabularTaskManager(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any)
Default task manager for tabular data.
- __init__(task: str, data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], pipeline: Optional[Type[Pipeline]] = None, pipeline_options: Optional[Dict] = None, extra_pipeline_options: Optional[Dict] = None, features: Optional[Union[List[str], List[int]]] = None, target: Optional[Union[str, int]] = None, **options: Any) None
- Parameters
task (str) – tabular_classification or tabular_regression
data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or tuple (X,y)
pipeline (Optional[Type[Pipeline]]) – class to be used as pipeline, by default None. If None, SimpleTabularPipeline will be used
pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will overwrite the ones from default_pipeline_options attribute.
extra_pipeline_options (Optional[Dict], optional) – arguments to be passed to the pipeline, by default None. These options will be passed in addition to the ones from default_pipeline_options attribute. This argument is ignored if pipeline_options is not None
features (Optional[ft.ColumnsList], optional) – names or indices of columns to be used as features, by default None. If None, all columns except the last one will be used. If target argument is not None, features should be passed explicitly as well
target (Optional[Union[str, int]], optional) – name or index of column to be used as target, by default None. If None, the last column will be used as target. If features argument is not None, target should be specified explicitly as well
- _create_pipeline(pipeline: Optional[Type[Pipeline]], options: Optional[Dict]) None
Initializes the pipeline.
- Parameters
pipeline (Optional[Type[Pipeline]]) – pipeline class
options (Optional[Dict]) – pipeline options
- _prepare_data(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], training: bool = True) Tuple[ndarray[Any, dtype[ScalarType]], ndarray[Any, dtype[ScalarType]], List[ColumnTypes]]
Initial data preparation: 1) optional: read data from the specified location; 2) split into features and targets. By default it is assumed that the last column is the target; 3) clean data; 4) determine numerical and categorical features (create categorical mask).
- Parameters
data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – path to data file or pandas dataframe or numpy array or Tuple(X,y)
- Returns
tuple of features, target and type mask for features
- Return type
Tuple[npt.NDArray, npt.NDArray, List[ColumnTypes]]
- property default_pipeline_options: Dict
Default options for pipeline.
- evaluate(test_data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple], silent: bool = False) Dict
Perfoms and prints the evaluation report on the given dataset.
- Parameters
test_data (Union[str, npt.NDArray, pd.DataFrame, Tuple]) – dataset to be used for evaluation
silent (bool) – controls whether the metrics are printed on screen, by default False
- performance_summary(test_data: Optional[Union[str, ndarray[Any, dtype[ScalarType]], DataFrame, Tuple]]) dict
Prints a performance summary of the model. The summary always includes metrics calculated for the train set. If the train/eval split was done during training, the summary includes metrics calculated on eval set. If test set is provided as an argument, the performance includes metrics calculated on test set.
- Parameters
test_data (Optional[Union[str, npt.NDArray, pd.DataFrame, Tuple]]) – data to be used as test set, by default None
- Returns
metrics for each subset
- Return type
dict
- predict(data: Union[str, ndarray[Any, dtype[ScalarType]], DataFrame]) ndarray[Any, dtype[ScalarType]]
Performs prediction on new data.
- Parameters
data (Union[str, npt.NDArray, pd.DataFrame]) – path to data file or pandas dataframe or numpy array
- Returns
predictions
- Return type
npt.NDArray
- predict_stored_subset(subset: str = 'train') ndarray[Any, dtype[ScalarType]]
Makes a prediction on a stored subset (train or eval).
- Parameters
subset (str, optional) – subset to predict on (train or eval), by default ‘train’
- Returns
predicted values
- Return type
npt.NDArray
- save_model(filename: Optional[str] = None, **kwargs: Any) ModelProto
Serializes and saves the model.
- Parameters
filename (Optional[str], optional) – filename for the model file, by default None. If filename is not specified, the model is not saved on disk and only returned as bytes object
- Returns
ONNX ModelProto of the model
- Return type
ModelProto
- train(make_eval_subset: bool = True, pre_eval: bool = False, **kwargs: Any) TabularTaskManager
Invokes the training procedure of an underlying pipeline. Print an expected model performance if available.
- Parameters
pre_eval (bool) – if True, first estimate model perfromance via 10 folds CV for small datasets or 25% test split for large datasets, by default False. Setting pre_eval = True is not reccomended as it pre-evaluates the pipeline as a whole which has lots of random elements therefore the results might be non reproducable
make_eval_subset (bool) – controls whether a dedicated eval set should be allocated for performance report, by default True. If True, overwrites the value of pre_eval to False
- Returns
self
- Return type