An explicit and structured process has been proposed for the conservation and enhancement of cultural heritage, through artificial intelligence, computer-driven art design. A framework has been advanced that employs the synergistic collaboration of a deep convolutional architecture, the Feature Pyramid Network (FPN), and a bio-based optimization method, the Modified Builder Optimization Algorithm (MBOA). Three closely related artifact objectives are involved in cultural heritage: (1) the classification and documentation of different artistic styles, (2) the digital re-creation of virtually complete or incomplete artifacts, and (3) the creation of new visual instances that are both stylistically and culturally situated in the past.
The proposed methodology aims at addressing important problems in heritage preservation, including the diversity of artistic form and expression across cultures, the degradation of physical objects, and the need for computationally intelligent systems that are able to learn, respond to, and re-express cultural semantics. By combining multi-scale feature learning with a metaheuristic-driven model optimization approach, the proposed framework offers semantic complexity and architectural efficacy at the same time that covers a wide range of art forms in different domains.
The preferred methodology consists of five interdependent components which create a cohesive pipeline, including (1) Dataset and Preprocessing, ensuring high-quality, diverse, and representational input; (2) Feature Pyramid Network (FPN) Architecture, allowing for hierarchical feature extraction at different spatial scales; (3) Modified Builder Optimization Algorithm (MBOA), a new optimization strategy has been proposed to improve the FPN hyper-parameters with the goal of optimizing learning dynamics; (4) Task-Specific Models, utilizing the tuned FPN into classification, restoration, and generative models; and (5) Training and Optimization Workflow, using MBOA and deep learning to direct the interaction of both algorithms to facilitate the best model convergence and generalization. Figure 2 illustrates the pipeline of the proposed methodology.

The pipeline of the proposed methodology.
This modular-yet-integrated design paradigm enables shared representation learning across tasks while preserving task-specific specialization, leading to both computational efficacy and cultural fidelity in learning. Utilizing FPN as a backbone means it is possible to learn both fine-grained textural features (e.g., brushstrokes in Impressionist paintings or unique geometric patterns in Islamic ontological art) and global compositional structures (e.g., symmetry in Byzantine artifacts or spatial depth in Renaissance constructions).
At the same time, MBOA-enabled learning improves the model’s adaptability in the face of high dimensional hyperparameter space, like varying learning rate schedules, filter dimensions, regularization strengths, and network depth, thus avoiding reliance on solely manual tuning or gradient-based methods, which often find suboptimal local minima.
Each aspect of the methodology is described in the following subsections and substantiated with mathematical frameworks, architectural diagrams, and parameters for reproducibility and scientific rigor. As a result, the framework presented is not just a technical tool but an AI framework with cultural sensitivity, one that has the ability to learn from past, recover that which is lost, and create new forms of artistic expression based on heritage.
The holistic, transformative methodology is a step forward in the application of AI in the cultural preservation space, as it provides an ethical and scalable models for continued research, practice and application supported and transparent at all stages for museums, archives, and education.
Dataset and preprocessing
This research builds on the WikiArt dataset, a publicly available large-scale archive of digitized fine art. The database contains over 80,000 high-resolution images of art from 26 different styles of art. The images were produced by over 1000 artists from a variety of cultural and historical backgrounds. The artwork is also tagged with metadata, including style, genre, artist, and date produced, which allows for supervised learning for classification and style-aware generation. Some examples of the WikiArt data including Metaphysical art, Early Dynastic, Middle kingdom, Magic realism, Neo-baroque, Nastaliq styles can be seen in Fig. 3.

Some samples of the WikiArt dataset: (A) Metaphysical art, (B) Early Dynastic, (C) Middle kingdom, (D) Magic realism, (E) Neo-baroque, (F) Nastaliq.
To ensure uniformity and compliance to a deep learning model, all images follow a defined preprocessing architecture. First, images are resized to the standard dimension of \(512\times 512\) pixels to account both for a manageable computation time and fine artistic detail, such as brushstrokes and textures. Images are resized using bicubic interpolation to mitigate artifacts. Next, pixel values are rescaled to a range of [0,1] by dividing by 255, and then mean subtraction and standard deviation normalization using ImageNet statistics:
$${X}_{norm}=\frac{X-\mu }{\sigma }$$
(1)
where, \(\mu =[\text{0.485,0.456,0.406}]\), and \(\sigma =[\text{0.229,0.224,0.225}]\)
To further enhance model generalization and make models more robust through avoiding overfitting, data augmentation has been applied while training. Data augmentation is random horizontal flips, \({15}^{\circ }\) rotations, limited color jitter (brightness, contrast, saturation ± 10% ), and limited scaling (± 10% ). Data Augmentation is not applied during validation or testing. Artistic styles are mapped to categorical class labels using one-hot-encoding, so models are designed to conduct a multi-class categorization task. Figure 4 shows the preprocessing pipeline of the cultural heritage.

Cultural heritage AI: preprocessing pipeline.
The dataset is partitioned into training (70%), validation (15%), and test (15%) subsets, ensuring that no artist or artwork appears across multiple splits to prevent data leakage. This rigorous preprocessing ensures that the input data is both representative of global artistic diversity and suitable for high-fidelity deep learning tasks.
In order to guarantee the quality of the datasets and the integrity of the information processed, the following parameters were strictly checked:
-
1.
Data Distribution Consistency: The WikiArt database was examined based on balance in its 27 artistic categories, where statistical tests ensured a balance ratio of 1:3.2 (e.g. Renaissance vs. Ukiyo-e) was suppressed through stratified sampling in train / validation / test splits9.
-
2.
Image Quality Metrics: To reduce artifacts all 512 × 512 images were resized using bicubic interpolation and PSNR was computed after resizing, to avoid fine-textures being lost.
-
3.
Completeness and Validity: Metadata annotations (style, artist, date) were said to be complete (99.2%) and checked against other databases art-historical to ensure stylistic correctness of metadata.
-
4.
Adding more Robustness: The training set was improved by adding images of random flips, rotations, color jitter, and ablation tests that showed that feature consistency did not go down (SSIM > 0.92 between augmented and original images).
-
5.
Noise and corruption checking: Synthetic degradation testing by Gaussian noise, pixel-mask confirmed the strength of this dataset against others, i.e. no intrinsic biases in such low quality samples10.
ImageNet statistics is employed in the normalization of the dataset because it has demonstrated itself to be effective in the cases of transfer learning and it is capable of increasing the generalization of the models in the case of a variety of visual fields.
ImageNet, a massive dataset of more than a million images in 1,000 categories, is a widely trained deep learning remotive, including convolutional neural networks (CNNs) model training. Normalization process is the subtraction of the mean and dividing it with the standard deviation of the ImageNet dataset which is used to center the data and scale it to the right size. It has been shown that with this approach, convergence can be accelerated and the models perform better when used on new datasets that are not otherwise linked to ImageNet.
As a more complex and diverse visual feature, the simple scaling and allocation of features are more suitable to the situation when considering cultural heritage preservation, where the variety of artistic styles and the intricacy of visual characteristics is a huge issue.
The empirical validation of this option occurs through other related uses in art classification and restoration where ImageNet-based normalization has helped to improve model performance and reduce training times. When implementing this established normalization tool, the proposed framework will take advantages of the developed knowledge and best practices in the area, and the following steps of feature extraction and model training will have a strong background.
Feature pyramid network (FPN) architecture
The first module is the modified feature pyramid network which produces initial feature maps at various scales and levels, The second module is the deep semantic embedding module which serves to increase the capacity for producing features that are appropriate in semantics, and high in spatial resolutions.
The third module uses a two branch deep feature fusion system which included two branches of deep features called top and down branch respectively, which were specifically created to separate levels of features and allow fusion. The deep features were then fused and used as input for the fourth module for classification. In Fig. 5 a general structure of the proposed approach is illustrated and consists of four main modules.

The overall framework of the proposed approach.
Enhanced feature pyramid network
Current CNN-based approaches are often formulated as an end-to-end problem and seek to learn an overall image-level representation from the raw image. It is widely accepted , however, that neurons in higher layers respond to the global aspects of the image, while neurons in lower layer or shallow layers respond to local features of the image. This suggests the use of local object-level features previously extracted from lower layers is important to improve the performance of the approach.
To address this shortcoming, a network design has been proposed as a pyramid that can gather local object-level features as well local image-level features. The proposed network model is based on the former FPN (Feature Pyramid Network) model.
The FPN can learn a feature pyramid across levels and scales of resolution. By comparison, the FPN uses bilinear interpolation or nearest neighbor as up sampling methods to create higher resolution feature maps, which causes a loss of high frequency components from the original higher resolution features, introduces discontinuities, and creates distorted object edges that will impact the accuracy of the generated feature maps. This pushes to utilize a more efficient method, deconvolution, for up sampling.
Deconvolution is a worthy operation in several areas including motion deblurring, super-resolution, and semantic segmentation; it is a helpful tool to recover details lost from the convolutional layers in the FPN while also eliminating noise and blur. The pyramid-like architecture proposed for this model will be labeled the enhanced feature pyramid network (EFPN).
It also has a bottom-up pathway, lateral connections, and a top-down pathway. It is important to clarify that in the top-down pathway, as illustrated in Fig. 3, the spatial resolution is increased using deconvolution. A full description of the EFPN will be described next. Figure 6 illustrates the design of the improved model.

Representation of the proposed enhanced feature pyramid network.
The bottom-up approach has been defined per level of the backbone, generating feature maps of the same dimensional resolution, and since the feature maps generated from the various levels needed to be multi-level and multi-scale, ResNet34 was used as the base backbone with five levels.
The training dataset has been defined with the notion \(Q=[\left({I}_{n}, {L}_{n}\right), n=1, 2, . . . , N]\), which demonstrates the number of samples as \(N\), the input is demonstrated as \({I}_{n}\), and the class label of \({I}_{n}\) as \({L}_{n}\). Each \({I}_{n}\) was put through ResNet34, and the outputs of the levels final residual block was recorded.
The feature map was generated with the following:
$${Y}_{p,q}=\mathcal{F}\left(X,\omega \right)={\sum_{i,j\in {\Omega }_{k}}{\omega }_{i+\frac{K-1}{2}, j+\frac{K-1}{2}}}^{T}{X}_{p+i, q+j}$$
(2)
where, feature map is defined by \({Y}_{p,q}\in {R}^{c}\), \(\mathcal{F}\) denotes the convolutional layer, the spatial coordinate by \((p,q)\), the output tensor \(Y\) and the input tensor \(X\) represent the final convolutional layers of ResNet34, and were generated with \(Y,X\in {R}^{H\times W\times C}\), \(W\) and \(H\) encode the spatial dimension, \(C\) provides the number of channels or feature maps, \(\omega \in {R}^{K\times K\times C}\) demonstrates the convolution kernel with \(C\) channels, and the local neighborhood implied as follows:
$${\Omega }_{k}=\left[\left(i,j\right):i=\left[-\frac{K-1}{2}, . . .,\frac{K-1}{2}\right], j=\left[-\frac{K-1}{2}, . . . ,\frac{K-1}{2}\right]\right]$$
(3)
where, there is a local neighborhood using \({\Omega }_{k}\), and \(K\) is an odd integer.
Using ResNet34, the phases of the outputs have been used as the initial bottom-up channels for \({I}_{n}\), denoted by \({F}_{i}\),indicated the phase \(i\) of the ResNet34.
In lateral links reducing the dimensions of channel (feature map) in \({F}_{i}\) bottom-up links, a \(1\times 1\) layer of convolution was used in the below manner:
$${L}_{i}= {\mathcal{F}}_{1}\left({F}_{i}, {\omega }_{1}\right), i=2, 3, 4, 5$$
(4)
where, \({\mathcal{F}}_{1}\left({F}_{i}, {\omega }_{1}\right)\) denotes convolution using \({\omega }_{1}\) having the size of 1 × 1. From there, on the basis of lateral connections, an improved state of the properties is conveyed from the higher levels of the bottom-up maps to the top-down maps, and the phases of the outputs conv2_3, conv3_4, conv4_6, conv5_3 have been used as the initial bottom-up channels for \({I}_{n}\), denoted by \({F}_{i}\epsilon {R}^{{H}_{i}\times {W}_{i}\times {C}_{i}}\), where \(i\) indicated the phase \(i\) of the ResNet34.
In the top-down path, realizing that the feature maps with the strongest semantics are spatially less detailed, a deconvolution processing block has been designed. In this processing block, a deconvolutional layer is simply followed by a BN (Batch Norm) layer and a ReLU (Rectified Linear Unit) also called Deconv-BN-ReLU. The idea of the deconvolution processing block, is to produce with a feature map of coarser resolution a spatial resolution increased by a factor of 2. The deconvolution operation can be denoted as:
$${T}_{i}=\mathcal{G}\left({P}_{i}, {\varphi }_{i}\right), i=3, 4, 5$$
(5)
In this case, we have shown via \(\mathcal{G}\left(. , {\varphi }_{i}\right)\), where \({\varphi }_{i}\) is a kernel with size \(3\times 3\). The higher resolution (or upsampled) feature map is added (added is shown via plus), so it can be element-wise combined with the relevant corresponding bottom-up feature map. Similarly, to reduce the pixel aliasing effect of the upsampling, I convolved the size 3times3 back onto the output of the element-wise sum. The resulting feature maps of EFPN are represented by misspace \({P}_{i}\), where \(i\) is between 3 and 5.
$${P}_{i}={\mathcal{F}}_{2}\left({T}_{i+1}\oplus {L}_{i}, {\omega }_{2}\right)$$
(6)
where, the addition operation of element-wise; the latter \(\oplus\) demonstrated through plus, and \({\mathcal{F}}_{2}\left(. , {\omega }_{2}\right)\) demonstrates a convolution with a kernel of \(3\times 3\) and \({\omega }_{2}\) variables. I should mention that within top-down path or part of EFPN, \({F}_{5}\) produces \({P}_{5}\) via a 1times1 convolution operation. The output of EFPN are the sets of \([{P}_{2}, {P}_{3}, {P}_{4}, {P}_{5}]\) recommended to \({[F}_{2}, {F}_{3}. {F}_{4},{F}_{5}]\).
Deep semantic embedding
As pooling methods and down-sampling by design in the bottom-up path continues, the resolution of the upper channels was diminishing. This decrease in spatial detail also limits their capability to delineate small-scale boundaries. The architecture combines the semantics of features by taking the high responses of bottom features and the strong activations of upper features because high instance responses have greater utility for localizing the objects accurately, and strong semantic activations provide information with a better understanding of the scenes.
So, a simple, lightweight DSE (Deep Semantic Embedding) module was created that merges features from different levels. This DSE module enables efficiently transferring spatial data to the target map without having to traverse numerous layers of the architecture. By merging the rich detail of lower-level features with the weak semantics of high-level features, DSE can utilize both pieces of complementary information and obtain more valid features.
For DSE the two-stream inputs were represented by \(P=[{P}_{i}, {P}_{j}]\), where \(j\) is between 3 and 5 and \({P}_{j}\in {R}^{{H}_{j}\times {W}_{j}\times {C}_{j}}\) is related to \({j}^{th}\) feature maps. The architecture of DSE has been illustrated using Fig. 7.

Representation of the proposed deep semantic embedding module.
For more accurate representation, two conv layers, with a 1 × 1 filters, and two conv layers with 3 × 3 filters were used to enable communication and fusion of cross-level features involved at adjacent levels. The features were then upsampled using deconvolution operator, so the higher-level features were upsampled and represented as the same scale as the lower level features. This can be noted by:
$${S}_{j-1}={\mathcal{F}}_{2}\left({P}_{j-1}, {\omega }_{2}\right)$$
(7)
where, \({\mathcal{F}}_{2}\left(., {\omega }_{2}\right)\) demonstrates convolution with the size of \(3\times 3\) using \({\omega }_{2}\).
$${S}_{j}=\mathcal{G}\left({\mathcal{F}}_{1}\left({P}_{j}, {\omega }_{1}\right), {\varphi }_{j}\right)$$
(8)
where, \({\mathcal{F}}_{1}\left(. , {\omega }_{1}\right)\) represents the convolution with the size of \(1\times 1\) using \({\omega }_{1}\), and the deconvolutions layer of DSE has been represented by \(\mathcal{G}\left(., {\varphi }_{j}\right)\). The combined upsampled features’ rich semantic information was then applied with the lower-level features through element-wise addition. In the same sense, a \(3\times 3\) convolution was then used, on the combined features map to minimize the aliasing effects on the final feature maps \({D}_{j-1}\), where \(j\) is between 3 and 5. The final outputs of DSE are then calculated as:
$${D}_{j-1}={\mathcal{F}}_{3}\left({S}_{j}\oplus {S}_{j-1}, {\omega }_{3}\right)$$
(9)
where, \({\mathcal{F}}_{3}\left(., {\omega }_{3}\right)\) depicts the convolution with the size of \(3\times 3\) using \({\omega }_{3}\).
Two-branch deep feature fusion
In an FPN (Feature Pyramid Network), a proposal at one level has been chosen for recognition based on the size of the objects, because object detection is especially about assigning a class to an object. Though it is simple and effective, this is not capable of great performance for some objectives of classification because the goal in some reasons is to recognize many combination features of different discriminative features.
The idea comes from the fact that when experts manually classify the samples, they offer semantic annotations based on the global aspect of the sample, but also the local features of samples as well, so we think that both global and local level features are necessary representation if we want to discriminate those elements or not.
Specifically, the higher level feature maps created by global receptive areas offer similar semantic features, and if lower level features can access them then their capacity for absorbing meaningful contextual information for prediction purposes will improve. Let me explain the converse of the feature maps with lower levels that are created by with local receptive areas that obtain fine level details to help identify samples. These features can help with higher level features replace for their loss of spatial information, which is favorable for classification tasks.
Based on this assessment, a two-branch deep feature fusion methodology has been developed to incorporate features from the lower and higher levels more effectively.
The architecture of two-branch deep feature fusion
The convolutional architecture has two different branches to consider both lower-level and higher-level feature maps. To simultaneously consider different levels and increase the receptive fields for extracting multi-level contextual information in an efficient manner without increasing computation costs, the proposed model will utilize both atrous convolution and convolution.
Atrous (or dilated) convolution has been proven to be an effective strategy for problems that require dense prediction. To mitigate the potential risk of network degradation from increased depth, skip connections have been included in the model. Furthermore, the last components of the model will consist of two Global Average Pooling (GAP) layers that have been used to create the representation features. A high-level view of the proposed two-branch deep feature fusion model has been shown in Fig. 8.

A high-level depiction of the proposed two-branch deep feature fusion module.
Top branch
This section where the high-level feature maps represented by \({D}_{4}\) stemming from DSE. It contains a global average pooling layer, as well as two residual blocks. The details of the residual blocks in this upper part have been outlined in Table 2, sequentially creating \(1\times 1\), \(3\times 3\) and \(3\times 3\) convolutional layers to learn more deeply invariant features. Each layer has a ReLU layer and batch normalization layer applied after every layer to allow for nonlinear transformations. The internal paths are merged with the output of a residual block through an element-wise addition layer.
The outcome of a residual block has been demonstrated in the following way:
$$Y = {\Psi }\left( {{\mathcal{F}}_{4} \left( {X, \omega_{4} } \right) \oplus {\mathcal{F}}\left( {X, \left( {\omega_{i} } \right)} \right)} \right)$$
(10)
where
$$\mathcal{F}\left(X, \left({\omega }_{i}\right)\right)={\mathcal{F}}_{3}\left({\mathcal{F}}_{2}\left({\mathcal{F}}_{1}\left(X, {\omega }_{1}\right), {\omega }_{2}\right), {\omega }_{3}\right)$$
(11)
where, the outputs and inputs of the residual block has been expressed by \(Y\) and \(X\).\({\mathcal{F}}_{1}\), \({\mathcal{F}}_{2}\), \({\mathcal{F}}_{3}\), and \({\mathcal{F}}_{4}\) denote the convolutional layers of sizes \(1\times 1\), \(3\times 3\), \(3\times 3\) and \(1\times 1\). \({\omega }_{1}\), \({\omega }_{2}\), \({\omega }_{3}\), and \({\omega }_{4}\) are the parameters, and \(\Psi\) denotes the ReLU function illustrated in Eq. (12). The ReLU and BN have been removed from Eq. (10) to reduce the notations.
$$\Psi \left(\text{x}\right)=\text{max}\left(x,0\right)$$
(12)
Now, the outcomes of the two residual blocks can be written in the following way:
$${Y}_{t}^{1}=\Psi \left({\mathcal{F}}_{4}^{1}\left({D}_{4}, {\omega }_{4}^{1}\right)\oplus {\mathcal{F}}_{3}^{1}\left({\mathcal{F}}_{2}^{1}\left({\mathcal{F}}_{1}^{1}\left({D}_{4},{\omega }_{1}^{1}\right), {\omega }_{2}^{1}\right), {\omega }_{3}^{1}\right)\right)$$
(13)
$${Y}_{t}^{2}=\Psi \left({\mathcal{F}}_{4}^{2}\left({Y}_{t}^{1}, {\omega }_{4}^{2}\right)\oplus {\mathcal{F}}_{3}^{2}\left({\mathcal{F}}_{2}^{2}\left({\mathcal{F}}_{1}^{2}\left({Y}_{t}^{1},{\omega }_{1}^{2}\right), {\omega }_{2}^{2}\right), {\omega }_{3}^{2}\right)\right)$$
(14)
where, the outputs of the first and second residual blocks within the top branch have been illustrated by \({Y}_{t}^{2}\) and \({Y}_{t}^{1}\). respectively. Note that upper-level feature maps \({D}_{4}\) has been used as input in Eq. (12), and, the output of the first block \({Y}_{t}^{1}\)’s has been used as input in Eq. (14).
After two residual blocks, GAP is fused to increase the relationships between feature maps and categories, which yields deep features. Finally, features yielded from the upper branch can be expressed as
$$Branh{c}_{t}=\sigma \left({Y}_{t}^{2}\right)=\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}{Y}_{i,j}^{l}$$
(15)
where GAP is demonstrated by \(\sigma\), and the feature map is given by \({Y}^{l}\in {R}^{H\times W}\) that has \(W\) width and \(H\) height for channel \(l\) of \({Y}_{t}^{2}\).
Down Branch
The lower branch, in contrast to the upper branch, takes the convolutional layer in the residual blocks and replaces it with atrous, in order to handle the different scales of the inputs. Given the effect of atrous convolution, the lower branch can expand the receptive fields, which provides definition samples or related contextual information to classify, while the details of the residual blocks in the lower branch can be seen in Table 3.
A standard convolution extends the kernel steadily by merging over all pixels within the kernels. On down branch, atrous uses convolution layer(s) to improve the receptive field of the output units, but does not increase the kernel size. Receptive field can be computed as:
$$M=K+\left(r-1\right)\left(K-1\right)=rK-r+1$$
(16)
where, atrous is with atrous rate \(r\) and kernel size of K × K, with size of \(M\times M\) , in Eq. (1), where atrous is of size K times K. Generally, U and V are input tensors and output tensors of the atrous, in its computations in \({R}^{H\times W\times C}\), where the spatial dimensions are shown through H and W, while channels are shown through C. So, each channel in \({V}_{p,q}\in {R}^{c}\) with location \((p , q)\) is computed as below:
$${V}_{p,q}={\sum_{i,j\epsilon {\Omega }_{\text{M}}}{\mu }_{i+\frac{M-1}{2}, j+\frac{M-1}{2}}}^{T}{U}_{p+i, q+j}$$
(17)
where
$${\Omega }_{\text{M}}=\left[\left(i,j\right):i=\left[-\frac{M-1}{2}, . . .,\frac{M-1}{2}\right], j=\left[-\frac{M-1}{2}, . . . ,\frac{M-1}{2}\right]\right]$$
(18)
where, a local neighborhood represented by \({\Omega }_{\text{M}}\) and vector \(\mu\) have been interpreted as the variables of atrous.
On top of atraous, the output of the residual block in down branch was computed using the following way:
$$V=\Psi ({\mathcal{H}}_{4}\left(U, {\mu }_{4}\right)\oplus \mathcal{H}\left(U, {[\mu }_{i}\right]))$$
(19)
where
$$\mathcal{H}\left(U, \left[{\mu }_{i}\right]\right)={\mathcal{H}}_{3}\left({\mathcal{H}}_{2}\left({\mathcal{H}}_{1}\left(U, {\mu }_{1}\right), {\mu }_{2}\right), {\mu }_{3}\right)$$
(20)
where the output and input of the residual blocks in down branch have been represented as \(V\) and \(U\).\({\mathcal{H}}_{1}\), \({\mathcal{H}}_{2}\), \({\mathcal{H}}_{3}\), and \({\mathcal{H}}_{4}\) show the atrous with the filter sizes of \(1\times 1\), \(3\times 3\), \(3\times 3\), and \(1\times 1\). In addition, \({\mu }_{1}\), \({\mu }_{2}\), \({\mu }_{3}\), and \({\mu }_{4}\) are the corresponding variables.
In addition, here, the output of the 2 residual blocks have been computed by the following equations.
$${V}_{d}^{1}=\Psi ({\mathcal{H}}_{4}^{1}\left({D}_{2}, {\mu }_{4}^{1}\right)\oplus {\mathcal{H}}_{3}^{1}\left({H}_{2}^{1}\left({\mathcal{H}}_{1}^{1}\left({D}_{2}, {\mu }_{1}^{1}\right), {\mu }_{2}^{1}, {\mu }_{3}^{1}\right)\right)$$
(21)
$${V}_{d}^{2}=\Psi ({\mathcal{H}}_{4}^{2}\left({V}_{d}^{1}, {\mu }_{4}^{2}\right)\oplus {\mathcal{H}}_{3}^{2}\left({H}_{2}^{2}\left({\mathcal{H}}_{1}^{2}\left({V}_{d}^{1}, {\mu }_{1}^{2}\right), {\mu }_{2}^{2}, {\mu }_{3}^{2}\right)\right)$$
(22)
where the outputs of the first block and the second block in the down branch were also expressed as \({V}_{d}^{1}\) and \({V}_{d}^{2}\), respectively. It should be emphasized that the lower-level feature maps \({D}_{2}\) were, in fact, input to Eq. (20), and \({V}_{d}^{1}\) was considered the input to Eq. (22).
GAP was then used in \({V}_{d}^{2}\) to enable the acquisition of the deep features in the down branch. So the output of the down branch can be expressed as follows:
$$Branc{h}_{d}=\sigma \left({V}_{d}^{2}\right)=\frac{1}{H\times W}{\sum }_{i=1}^{H}{\sum }_{j=1}^{W}{V}_{i,j}^{l}$$
(23)
where the feature map was represented in the \({V}^{l}\in {R}^{H\times W}\) that has width W and height H in relation to channel \(l\) of \({V}_{d}^{2}\).
Eventually, TDFF with atrous takes the deep feature b means of the two branches \(Branc{h}_{d}\) and \(Branc{h}_{t}\) fused through a serial feature fusion process that fuses the two types of distinctive features to determine a fusion of more informative and meaningful features.
Classification
The joined deep meaning-based features are given to the second classification module. The classification module contain a softmax layer and a fully connected layer which is used to evaluate the classification label of the input.
Assuming the output of the fully connected layer (FCL), the softmax function follows from the following:
$${\theta }_{i}=\frac{\text{exp}({z}_{i})}{\sum_{j=1}^{m}\text{exp}({z}_{j})}$$
(24)
$$\theta =\text{max}({\theta }_{1}, {\theta }_{2}, . . . , {\theta }_{m})$$
(25)
where, \(Z=[{z}_{i}, i=1, 2, . . . , m]\) represents the output of the FCL, \(m\) defines the number of class labels, \({\theta }_{i}\) regards the probability that the input corresponds to class \(i\) The end output class annotation has been represented through \(\theta\). In addition, we further formulated the cross-entropy loss function, which is given below, during the classification:
$$Loss=-\frac{1}{N}\sum_{n=1}^{N}\sum_{j=1}^{m}1\left[{y}^{n}=j\right]log{\theta }_{j}$$
(26)
where, \(y\) represent the true label, m denote the number of classes, \(m\) denote the mini-batch size, and \(1[*]\). indicate an indicator function. Further, if \({y}^{n}\)= \(i\), then \(1\left[{y}^{n}=i\right]\) equals 1; otherwise equals 0.
Modified builder optimization algorithm
The theoretical foundations of the recently established BOA (Builder Optimization Algorithm) are thoroughly explained in the present section. The methodical techniques that builders employ during the building process serve as an inspiration for the BOA. The BOA’s main idea has been founded on 2 main construction techniques.
One of them is making major structural changes in compliance with the design specifications, and the other one is carrying out painstaking adjustments for improving appearance and addressing complex details. Actually, BOA’s theoretical underpinnings are explained through different stages of mathematical modeling: local and global search.
Inspiration of BOA
Building construction is a methodical process that requires careful planning and execution at several stages. Constructors use a systematic procedure to transform raw materials into a finished structure. The present process starts with a building shaping phase and ends with a refinement phase for aesthetics, details, and optimizations. Inspired by the current approach, BOA simulates these two crucial procedures in an optimization environment.
Every population solution in the BOA algorithm represents an ongoing structural design. The iterative improvement of these structures drives the optimization process, with each structure becoming more and more refined through modifications that adhere to an ideal design. The current process’s mathematical model aims to allow BOA to balance local exploitation (small adjustments for optimization) and global exploration (large structural modifications).
Algorithm initialization
Each member of the population represents a potential building structure in the search space by making BOA a population-centric metaheuristic. The configuration of every structure is determined by a set of design factors. Every structure is represented mathematically as a vector, with each member standing for a distinct design parameter.
In order to ensure diversity in the initial findings, the locations of every structure in the BOA process have been first initialized at random inside the search space. In order to effectively distribute structures within the viable zone, this initialization follows a given equation, as shown in Eq. (27).
$${x}_{i,d}=l{b}_{d}+r\cdot \left(u{b}_{d}-l{b}_{d}\right)$$
(27)
here, the structure \(i\) or candidate solution has been represented through \({X}_{i}\), design parameters’ amount has been shown via \(m\), the design parameter \(j\) has been specified through \({x}_{i,j}\), and the total quantity has been displayed through \(N\). Actually, the stochastic amount has been illustaretd via \(r\) that has been ranging from one to zero, and \(j\) parameters’ maximum and minimum bounds have been, in turn, specified via \(u{b}_{j}\) and \(l{b}_{j}\). All of the structures of BOA display a comprehensive decision variable group through making a population matrix.
$$X={\left[\begin{array}{c}{X}_{1}\\ \vdots \\ {X}_{i}\\ \vdots \\ {X}_{N}\end{array}\right]}_{N\times m}={\left[\begin{array}{ccccc}{x}_{\text{1,1}}& \cdots & {x}_{1,d}& \cdots & {x}_{1,m}\\ \vdots & \ddots & \vdots & & \vdots \\ {x}_{i,1}& \cdots & {x}_{i,d}& \cdots & {x}_{i,m}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ {x}_{N,1}& \cdots & {x}_{N,d}& \cdots & {x}_{N,m}\end{array}\right]}_{N\times m}$$
(28)
where, \(X\) has been the BOA’s population matrix. An objective function that gauges each candidate structure’s degree of alignment with the ideal design is used to evaluate it. As demonstrated by Eq. (29), the goal function values for the whole population are kept in a vector.
$$F={\left[\begin{array}{c}{F}_{1}\\ \vdots \\ {F}_{i}\\ \vdots \\ {F}_{N}\end{array}\right]}_{N\times 1}={\left[\begin{array}{c}F\left({X}_{1}\right)\\ \vdots \\ F\left({X}_{i}\right)\\ \vdots \\ F\left({X}_{N}\right)\end{array}\right]}_{N\times 1}$$
(29)
here, \(F\) represents the vector of cost function, with \({F}_{i}\) being the cost function of the structure \(i\). Actually, the structure exhibiting the optimal cost function value has been recognized as the most efficient structure. During the optimization process, the placements of structures have been iteratively refined over 2 principal stages that have been detailed in the subsequent parts.
Phase 1: Extensive structural modifications (Exploration Phase)
Builders first focus on building a structure’s main framework during the construction phase. Major changes are made during this phase, and the design blueprint is followed in the installation of the main components, including the floors, walls, and beams. The current stage of BOA is known as the exploration stage, where candidate structures are significantly altered to explore different search space regions. The design configurations that exhibit superior objective function values serve as benchmarks for important modifications for every structure in the population. Equation (30) is used to determine the range of potential changes for each structure:
$$C{M}_{i}=\left\{{X}_{k}\mid {F}_{k}\le {F}_{i}\right\}$$
(30)
here, the individual alterations group of structure \(i\) has been shown via \(C{M}_{i}\), and a structure configuration by a greater cost function value has been represented through \({X}_{k}\). As shown in Eq. (30), a modification is selected at random from the candidate designs, and the location of the associated structure is updated based on a mathematical model developed from significant structural alterations made during construction. According to Eq. (31), the adjustment is accepted if the updated configuration improves the goal function.
$${x}_{i,j}^{P1}={x}_{i,j}+1\cdot \text{cos}\left(\frac{\pi }{2}r\right)\cdot \left(S{M}_{i,j}-1\cdot {x}_{i,j}\right)$$
(31)
$${X}_{i}=\left\{\begin{array}{cc}{X}_{i}^{P1},& {F}_{i}^{P1}<{F}_{i}\\ {X}_{i},& \text{ else}\end{array}\right.$$
(32)
here, the chosen alteration reference has been shown through \(S{M}_{i}\), is its jth parameter \(j\) of it has been represented via \(S{M}_{i,j}\), the stochastic quantity ranging from zero to one has been shown via \(r\), the stochastic amount chosen between two and one has been represented via \(I\), \({X}_{i}^{p1}\) has been the improved deign in the BOA’s 1st phase of BOA, the design parameter \(j\) of it has been shown through \({x}_{i,j}^{P1}\), and \({F}_{i}^{P1}\) has been the improved cost function value.
Phase 2: Detailed refinements for aesthetic and structural optimization (exploitation phase)
After a building’s basic structure has been established, the next stage focuses on improving the structure by adding details, aesthetics, and functional optimizations. The current stage involves making subtle, but important adjustments, such as lining up architectural elements, modifying finishing materials, and ensuring structural balance. This stage of BOA is known as the exploitation stage, during which little modifications are made next to the current structures in order to maximize their designs. Equation (33) is used to determine the optimal placements for structures, and Eq. (34) authorizes updates if the objective function is improved by these changes.
$${x}_{i,j}^{P2}={x}_{i,j}+\left(1-2\cdot \text{cos}\left(\frac{\pi }{2}r\right)\right)\frac{\left(u{b}_{j}-l{b}_{j}\right)}{t}$$
(33)
$${X}_{i}=\left\{\begin{array}{cc}{X}_{i}^{P2},& {F}_{i}^{P2}<{F}_{i}\\ {X}_{i},& \text{ else}\end{array}\right.$$
(34)
here, the 2nd stage’s updated structure has been represented through \({X}_{i}^{p2}\), the improved cost function has ben specified through \({F}_{i}^{P2}\), and the design parameter \(j\) has been displayed via \({x}_{i,j}^{P2}\). Also, the stochastic amount has been demonstrated via \(r\) that has been between zero and one, and the iteration amount has been shown via \(t\). BOA efficiently optimizes complicated design problems by striking a balance between local exploitation and global exploration through the interaction of these two stages.
Modified builder optimization algorithm (MBOA)
The Modified Builder Optimization Algorithm (MBOA) is proposed to optimize FPN model hyperparameters with a better convergence speed and performance. MBOA is motivated by swarm building behavior in social organisms like termites or bees in that the agents build things, get feedback on the values of their builds, and refine them through a degree of local social interaction.
In the original Builder Optimization Algorithm (BOA), a population of \(N\) agents (builders) explore the solution space (hyperparameter space) by each building candidate solutions (i.e., hyperparameter sets) by additive operations. Each agent, \({\mathbf{x}}_{i}=\left[{x}_{i1},{x}_{i2},\dots ,{x}_{id}\right]\) represents a \(d\)-dimensional vector of hyperparameters, such as learning rate, number of filters, dropout rate, and regularization coefficients.
The position update in BOA is governed by:
$${x}_{i,j}^{P2}={x}_{i,j}+\alpha .{r}_{1}\cdot \left({b}_{\text{best }}-{x}_{i,j}\right)+\beta .{r}_{2}\cdot \left({b}_{\text{local }}-{x}_{i,j}\right)$$
(35)
where, \({b}_{\text{best}}\) is the global best solution, \({b}_{\text{local}}\) is a neighborhood best, \({r}_{1}\) and \({r}_{2}\) are random vectors in \([\text{0,1}]\), and \(\alpha\) and \(\beta\) control exploration and exploitation.
Modifications in MBOA:
-
1.
Adaptive Step Control: The step size \(\alpha\) is dynamically adjusted based on solution diversity:
$${\alpha }^{(t)}={\alpha }_{0}\cdot \text{exp}\left(-\gamma \cdot \frac{t}{T}\right)\cdot \left(1+\delta \cdot \frac{{\sigma }^{(t)}}{{\sigma }_{0}}\right)$$
(36)
where, \({\sigma }^{(t)}\) is the population standard deviation at iteration \(t\), encouraging exploration early and exploitation late.
-
2.
Elitist Selection: The top \(10\text{\%}\) of solutions are preserved unaltered in each generation to prevent loss of high performing configurations.
-
3.
Hybrid Gradient Feedback (Optional): For continuous variables (e.g., learning rate), a lightweight gradient signal from a validation batch is fused into the update:
$$\Delta {x}_{i}^{grad}=-\eta {\nabla }_{x}{\mathcal{L}}_{val}$$
(37)
which is combined with the BOA update for faster convergence.
The objective function optimized by MBOA is the validation accuracy for classification, SSIM for restoration, or FID score for generation, depending on the task. MBOA runs for 100 iterations with a population size of 30, searching within predefined bounds (e.g., learning rate: \(\left[{10}^{-5},{10}^{-2}\right]\), filters in the interval \([\text{32,512}]\)).
Task-specific models
The FPN, once optimized by MBOA, is adapted to three distinct tasks in cultural heritage preservation.
Classification: In the case of artistic style classification, the multi-scale features from \({P}_{3}\) to \({P}_{5}\) have been concatenated and passed through GAP to create a fixed-length vector for K-class classification with a softmax fully connected layer.
$${\mathbf{z}} = {\mathbf{GAP}}\left( {{\text{Concat}}\left( {P_{3} ,P_{4} ,P_{5} } \right)} \right),{ }\hat{y} = {\text{Softmax}}\left( {{\mathbf{Wz}} + {\mathbf{b}}} \right)$$
(38)
Restoration
For restoring degraded or incomplete artworks, the FPN serves as the encoder in a U-Net-like autoencoder11. The encoded characteristics are upsampled through transposed convolutions, including skip connections from the FPN levels to restore spatial details. The loss function includes a combination of the \({L}_{1}\) reconstruction and perceptual loss using VGG-19 features:
$${\mathcal{L}}_{\text{rest }}={\lambda }_{1}{\Vert {\mathbf{I}}_{\text{true }}-{\mathbf{I}}_{\text{pred }}\Vert }_{1}+{\lambda }_{2}\sum_{l} {\Vert {\phi }_{l}\left({\mathbf{I}}_{\text{true }}\right)-{\phi }_{l}\left({\mathbf{I}}_{\text{pred }}\right)\Vert }_{2}^{2}$$
(39)
Generation
The FPN is combined with a conditional GAN architecture to generate new artworks in a certain style. The FPN uses a reference artwork to extract stylized features, which conditioning is applied to the generator. The generator (U-net) generates high-resolution images. The discriminator utilizes FPN in evaluating both inputs only to determine if they are realistic, and whether the generated image is stylistically consistent with the image reference art.
The used GAN is based on U-Net with skip connections, as an encoder-decoder network with 8 down-sampling and up-sampling blocks. All downsampling blocks consist of a 4 × 4 convolution, batch normalization, and LeakyReLU activation (\(\alpha =0.2\)), whereas all upsampling blocks incorporate transposed convolutions, batch normalization and ReLU, except that the last layer uses a tanh activation to bring the output close to the input image range.
The style embedding generated by FPN, by extracting a reference artwork is injected into the generator through adaptive instance normalization (AdaIN) at each upsampling layer that allows dynamically modulating feature statistics to match the desired style. The discriminator, i.e., multi-scale PatchGAN with three resolutions of 512 × 512, 256 × 256 and 128 × 128, is used to get fine-scale textures and total consistency. It trains in a spectral normalization to provide stability and also uses FPN attributes of both the real and generated images to impose stylistic consistency. The design will make sure that the generated outputs are not merely photorealistic but also semantically consistent with cultural and aesthetic characteristics of the source style. The adversarial loss is defined as:
$${\mathcal{L}}_{\text{CAN}}={\mathbb{E}}\left[\text{log}D\left({\mathbf{I}}_{\text{real }}\right)\right]+{\mathbb{E}}\left[\text{log}\left(1-D\left(G\left(\mathbf{z},\mathbf{s}\right)\right)\right)\right]$$
(40)
where, \(s\) is the style embedding from FPN. Figure 9 shows the task pipeline architecture.

Task pipeline architecture.
Training and optimization workflow
The complete training and optimization workflow follows a two-phase strategy:
-
1.
Optimization Phase: MBOA is executed to search for the optimal configuration of FPN hyperparameters. In each iteration, a candidate configuration is used to instantiate the FPN, which is trained for a few epochs (e.g., 5) on the training set. The validation performance is recorded and used as the fitness value. After 100 iterations, the bestperforming configuration is selected.
-
2.
Training Phase: The FPN is instantiated using the optimized parameters, then it is trained end-to-end on the training set for 100 epochs using the Adam optimizer (initial LR adjusted by MBOA). ReduceLROnPlateau for learning rate scheduling and early stopping (patience = 10) is employed.
The workflow ensures that the model architecture and learning parameters are globally optimized before full training. leading to faster convergence and higher performance. This hybrid optimization-deep learning approach leverages the global search capability of MBOA and the local refinement power of gradient descent. Figure 10 shows the workflow for training and optimization.

Training and optimization workflow.
This approach outlines a strong, scalable, and culturally aware framework for Al-driven cultural heritage preservation by integrating the representational potential of deep networks with the intelligent exploration of bio-inspired optimization. The FPN and MBOA are combined to deliver improved performance for the classification, restoration, and generation approaches as presented in the upcoming experimental section.




