metacluster.utils package¶
metacluster.utils.cluster module¶
- metacluster.utils.cluster.compute_Wk(data: ndarray, classification_result: ndarray)[source]¶
This function computes the Wk after each clustering
- Parameters
data (np.array, containing all the data) –
classification_result (np.ndarray, containing all the clustering results for all the data) –
- Returns
Wk
- Return type
float
- metacluster.utils.cluster.compute_gap_statistic(X, refs=None, B=10, list_K=None, N_init=10)[source]¶
This function first generates B reference samples; for each sample, the sample size is the same as the original datasets; the value for each reference sample follows a uniform distribution for the range of each feature of the original datasets; using simplify formula to compute the D of each cluster, and then the Wk; K should be a increment list, 1-10 is fair enough; the B value is about the number of replicated samples to run gap-statistics, it is recommended as 10, and it should not be changed/decreased that to a smaller value;
- Parameters
X (np.array, the original data;) –
refs (np.ndarray or None, it is the replicated data that you want to compare with if there exists one; if no existing replicated/proper data, just use None, and the function will automatically generates them;) –
B (int, the number of replicated samples to run gap-statistics; it is recommended as 10, and it should not be changed/decreased that to a smaller value;) –
K (list type, the range of K values to test on;) –
N_init (int, states the number of initial starting points for each K-mean running under sklearn, in order to get stable clustering result each time;) –
- Returns
gaps (np.array, containing all the gap-statistics results;)
s_k (float, the baseline value to minus with; say reference paper for detailed meaning;)
K (list, containing all the tested K values;)
- metacluster.utils.cluster.get_clusters_by_calinski_harabasz(X, list_clusters=None, **kwargs)[source]¶
- metacluster.utils.cluster.get_clusters_by_elbow(X, list_clusters=None, **kwargs)[source]¶
First, apply K-means clustering to the dataset for a range of different values of K, where K is the number of clusters. For example, you might try K=1,2,3,…,10.
For each value of K, compute the Sum of Squared Errors (SSE), which is the sum of the squared distances between each data point and its assigned centroid. The SSE can be obtained from the KMeans object’s inertia_ attribute.
Plot the SSE for each value of K. You should see that the SSE decreases as K increases, because as K increases, the centroids are closer to the data points. However, at some point, increasing K further will not improve the SSE as much. The idea of the elbow method is to identify the value of K at which the SSE starts to level off or decrease less rapidly, forming an “elbow” in the plot. This value of K is considered the optimal number of clusters.
metacluster.utils.data_loader module¶
- class metacluster.utils.data_loader.Data(X, y=None, name='Unknown')[source]¶
Bases:
objectThe structure of our supported Data class
- Parameters
X (np.ndarray) – The features of your data
y (np.ndarray, Optional, default=None) – The labels of your data, for clustering problem, this can be None
- SUPPORT = {'scaler': ['StandardScaler', 'MinMaxScaler', 'MaxAbsScaler', 'RobustScaler', 'Normalizer']}¶
metacluster.utils.encoder module¶
- class metacluster.utils.encoder.LabelEncoder[source]¶
Bases:
objectEncode categorical features as integer labels.
- fit_transform(y)[source]¶
Fit label encoder and return encoded labels.
- Parameters
y (array-like of shape (n_samples,)) – Target values.
- Returns
y – Encoded labels.
- Return type
array-like of shape (n_samples,)
metacluster.utils.io_util module¶
metacluster.utils.mealpy_util module¶
- class metacluster.utils.mealpy_util.KCentersClusteringProblem(bounds=None, minmax=None, data=None, obj_name=None, **kwargs)[source]¶
Bases:
Problem
metacluster.utils.validator module¶
metacluster.utils.visualize_util module¶
- metacluster.utils.visualize_util.export_boxplot_figures(df, figure_size=(500, 600), xlabel='Optimizer', ylabel=None, title='Boxplot of comparison models', show_legend=True, show_mean_only=False, exts=('.png', '.pdf'), file_name='boxplot', save_path='history')[source]¶
- Parameters
df (pd.DataFrame) –
- The format of df parameter:
optimizer DBI FBIO 1.18145 FBIO 1.1815 GWO 1.18145 GWO 1.18153 FBIO 1.18147 FBIO 1.18145 GWO 1.18137
figure_size (list, tuple, np.ndarray, None; default=None) – The size for saved figures. None means it will automatically set for you. Or you can pass (width, height) of figure based on pixel (100px to 1500px)
xlabel (str; default="Optimizer") – The label for x coordinate of boxplot figures.
ylabel (str; default=None) – The label for y coordinate of boxplot figures.
title (str; default="Boxplot of comparison models") – The title of figures, it should be the same for all objectives since we have y coordinate already difference.
show_legend (bool; default=True) – Show the legend or not. For boxplots we can turn on or off this option, but not for convergence chart.
show_mean_only (bool; default=False) – You can show the mean value only or you can show all mean, std, median of the box by this parameter
exts (list, tuple, np.ndarray; default=(".png", ".pdf")) – List of extensions of the figures. It is for multiple purposes such as latex (need “.pdf” format), word (need “.png” format).
file_name (str; default="boxplot") – The prefix for filenames that will be saved.
save_path (str; default="history") – The path to save the figure
- metacluster.utils.visualize_util.export_convergence_figures(df, figure_size=(500, 600), xlabel='Epoch', ylabel='Fitness value', title='Convergence chart of comparison models', legend_name='Optimizer', exts=('.png', '.pdf'), file_name='convergence', save_path='history')[source]¶
- Parameters
df (pd.DataFrame) –
- The format of df parameter:
FBIO GWO 62.62501039 62.72457583 62.62085777 62.71386468 62.62085777 62.71386468 62.62085777 62.71386468 62.62085777 62.66383109 62.62085777 62.66310589
figure_size (list, tuple, np.ndarray, None; default=None) – The size for saved figures. None means it will automatically set for you. Or you can pass (width, height) of figure based on pixel (100px to 1500px)
xlabel (str; default="Optimizer") – The label for x coordinate of convergence figures.
ylabel (str; default=None) – The label for y coordinate of boxplot figures.
title (str; default="Convergence chart of comparison models") – The title of figures, it should be the same for all objectives since we have y coordinate already difference.
legend_name (str; default="Optimizer") – Set the name for the legend.
exts (list, tuple, np.ndarray; default=(".png", ".pdf")) – List of extensions of the figures. It is for multiple purposes such as latex (need “.pdf” format), word (need “.png” format).
file_name (str; default="convergence") – The prefix for filenames that will be saved.
save_path (str; default="history") – The path to save the figure