metacluster.utils package¶

metacluster.utils.cluster module¶

metacluster.utils.cluster.compute_Wk(data: ndarray, classification_result: ndarray)[source]¶

This function computes the Wk after each clustering

Parameters

data (np.array, containing all the data) –
classification_result (np.ndarray, containing all the clustering results for all the data) –

Returns

Wk

Return type

float

metacluster.utils.cluster.compute_all_methods(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.compute_gap_statistic(X, refs=None, B=10, list_K=None, N_init=10)[source]¶

This function first generates B reference samples; for each sample, the sample size is the same as the original datasets; the value for each reference sample follows a uniform distribution for the range of each feature of the original datasets; using simplify formula to compute the D of each cluster, and then the Wk; K should be a increment list, 1-10 is fair enough; the B value is about the number of replicated samples to run gap-statistics, it is recommended as 10, and it should not be changed/decreased that to a smaller value;

Parameters

X (np.array, the original data;) –
refs (np.ndarray or None, it is the replicated data that you want to compare with if there exists one; if no existing replicated/proper data, just use None, and the function will automatically generates them;) –
B (int, the number of replicated samples to run gap-statistics; it is recommended as 10, and it should not be changed/decreased that to a smaller value;) –
K (list type, the range of K values to test on;) –
N_init (int, states the number of initial starting points for each K-mean running under sklearn, in order to get stable clustering result each time;) –

Returns

gaps (np.array, containing all the gap-statistics results;)
s_k (float, the baseline value to minus with; say reference paper for detailed meaning;)
K (list, containing all the tested K values;)

metacluster.utils.cluster.get_all_clustering_metrics()[source]¶

metacluster.utils.cluster.get_clusters_all_majority(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_all_max(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_all_mean(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_all_min(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_by_bic(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_by_calinski_harabasz(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_by_davies_bouldin(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_by_elbow(X, list_clusters=None, **kwargs)[source]¶

First, apply K-means clustering to the dataset for a range of different values of K, where K is the number of clusters. For example, you might try K=1,2,3,…,10.
For each value of K, compute the Sum of Squared Errors (SSE), which is the sum of the squared distances between each data point and its assigned centroid. The SSE can be obtained from the KMeans object’s inertia_ attribute.
Plot the SSE for each value of K. You should see that the SSE decreases as K increases, because as K increases, the centroids are closer to the data points. However, at some point, increasing K further will not improve the SSE as much. The idea of the elbow method is to identify the value of K at which the SSE starts to level off or decrease less rapidly, forming an “elbow” in the plot. This value of K is considered the optimal number of clusters.

metacluster.utils.cluster.get_clusters_by_gap_statistic(X, list_clusters=None, B=10, N_init=10, **kwargs)[source]¶

metacluster.utils.cluster.get_clusters_by_silhouette_score(X, list_clusters=None, **kwargs)[source]¶

metacluster.utils.data_loader module¶

class metacluster.utils.data_loader.Data(X, y=None, name='Unknown')[source]¶

Bases: object

The structure of our supported Data class

Parameters

X (np.ndarray) – The features of your data
y (np.ndarray, Optional, default=None) – The labels of your data, for clustering problem, this can be None

SUPPORT = {'scaler': ['StandardScaler', 'MinMaxScaler', 'MaxAbsScaler', 'RobustScaler', 'Normalizer']}¶

get_name()[source]¶

static scale(X, method='MinMaxScaler', **kwargs)[source]¶

set_train_test(X_train=None, y_train=None, X_test=None, y_test=None)[source]¶

Function use to set your own X_train, y_train, X_test, y_test in case you don’t want to use our split function

Parameters

X_train (np.ndarray) –
y_train (np.ndarray) –
X_test (np.ndarray) –
y_test (np.ndarray) –

split_train_test(test_size=0.2, train_size=None, random_state=41, shuffle=True, stratify=None, inplace=True)[source]¶: The wrapper of the split_train_test function in scikit-learn library.

metacluster.utils.data_loader.get_dataset(dataset_name)[source]¶

Helper function to retrieve the data

Parameters: dataset_name (str) – Name of the dataset
Returns: data – The instance of Data class, that hold X and y variables.
Return type: Data

metacluster.utils.encoder module¶

class metacluster.utils.encoder.LabelEncoder[source]¶

Bases: object

Encode categorical features as integer labels.

fit(y)[source]¶

Fit label encoder to a given set of labels.

yarray-like: Labels to encode.

fit_transform(y)[source]¶

Fit label encoder and return encoded labels.

Parameters: y (array-like of shape (n_samples,)) – Target values.
Returns: y – Encoded labels.
Return type: array-like of shape (n_samples,)

inverse_transform(y)[source]¶

Transform integer labels to original labels.

yarray-like: Encoded integer labels.

original_labelsarray-like: Original labels.

transform(y)[source]¶

Transform labels to encoded integer labels.

yarray-like: Labels to encode.

encoded_labelsarray-like: Encoded integer labels.

metacluster.utils.io_util module¶

metacluster.utils.io_util.write_dict_to_csv(data: dict, save_path=None, file_name=None)[source]¶

Write a list of dictionaries to a CSV file.

Parameters

data (list) – A list of dictionaries.
save_path (str) – Path to save the file
file_name (str) – The name of the output CSV file.

Returns

None

metacluster.utils.mealpy_util module¶

class metacluster.utils.mealpy_util.KCentersClusteringProblem(bounds=None, minmax=None, data=None, obj_name=None, **kwargs)[source]¶

Bases: Problem

get_metrics(solution=None, list_metric=None, list_paras=None)[source]¶

static get_y_pred(X, solution)[source]¶

obj_func(solution)[source]¶

Objective function

Parameters: x (numpy.ndarray) – Solution.
Returns: Function value of x.
Return type: float

class metacluster.utils.mealpy_util.KMeansParametersProblem(bounds=None, minmax='min', X=None, obj_name=None, seed=None, **kwargs)[source]¶

Bases: Problem

get_model(solution) → KMeans[source]¶

obj_func(solution)[source]¶

Objective function

Parameters: x (numpy.ndarray) – Solution.
Returns: Function value of x.
Return type: float

metacluster.utils.validator module¶

metacluster.utils.validator.check_bool(name: str, value: bool, bound=(True, False))[source]¶

metacluster.utils.validator.check_float(name: str, value: int, bound=None)[source]¶

metacluster.utils.validator.check_int(name: str, value: int, bound=None)[source]¶

metacluster.utils.validator.check_str(name: str, value: str, bound=None)[source]¶

metacluster.utils.validator.check_tuple_float(name: str, values: tuple, bounds=None)[source]¶

metacluster.utils.validator.check_tuple_int(name: str, values: tuple, bounds=None)[source]¶

metacluster.utils.validator.is_in_bound(value, bound)[source]¶

metacluster.utils.validator.is_str_in_list(value: str, my_list: list)[source]¶

metacluster.utils.visualize_util module¶

metacluster.utils.visualize_util.export_boxplot_figures(df, figure_size=(500, 600), xlabel='Optimizer', ylabel=None, title='Boxplot of comparison models', show_legend=True, show_mean_only=False, exts=('.png', '.pdf'), file_name='boxplot', save_path='history')[source]¶

Parameters

df (pd.DataFrame) –

The format of df parameter:
optimizer DBI FBIO 1.18145 FBIO 1.1815 GWO 1.18145 GWO 1.18153 FBIO 1.18147 FBIO 1.18145 GWO 1.18137
figure_size (list, tuple, np.ndarray, None; default=None) – The size for saved figures. None means it will automatically set for you. Or you can pass (width, height) of figure based on pixel (100px to 1500px)
xlabel (str; default="Optimizer") – The label for x coordinate of boxplot figures.
ylabel (str; default=None) – The label for y coordinate of boxplot figures.
title (str; default="Boxplot of comparison models") – The title of figures, it should be the same for all objectives since we have y coordinate already difference.
show_legend (bool; default=True) – Show the legend or not. For boxplots we can turn on or off this option, but not for convergence chart.
show_mean_only (bool; default=False) – You can show the mean value only or you can show all mean, std, median of the box by this parameter
exts (list, tuple, np.ndarray; default=(".png", ".pdf")) – List of extensions of the figures. It is for multiple purposes such as latex (need “.pdf” format), word (need “.png” format).
file_name (str; default="boxplot") – The prefix for filenames that will be saved.
save_path (str; default="history") – The path to save the figure

metacluster.utils.visualize_util.export_convergence_figures(df, figure_size=(500, 600), xlabel='Epoch', ylabel='Fitness value', title='Convergence chart of comparison models', legend_name='Optimizer', exts=('.png', '.pdf'), file_name='convergence', save_path='history')[source]¶

Parameters

df (pd.DataFrame) –

The format of df parameter:
FBIO GWO 62.62501039 62.72457583 62.62085777 62.71386468 62.62085777 62.71386468 62.62085777 62.71386468 62.62085777 62.66383109 62.62085777 62.66310589
figure_size (list, tuple, np.ndarray, None; default=None) – The size for saved figures. None means it will automatically set for you. Or you can pass (width, height) of figure based on pixel (100px to 1500px)
xlabel (str; default="Optimizer") – The label for x coordinate of convergence figures.
ylabel (str; default=None) – The label for y coordinate of boxplot figures.
title (str; default="Convergence chart of comparison models") – The title of figures, it should be the same for all objectives since we have y coordinate already difference.
legend_name (str; default="Optimizer") – Set the name for the legend.
exts (list, tuple, np.ndarray; default=(".png", ".pdf")) – List of extensions of the figures. It is for multiple purposes such as latex (need “.pdf” format), word (need “.png” format).
file_name (str; default="convergence") – The prefix for filenames that will be saved.
save_path (str; default="history") – The path to save the figure