Floating point

GitLab & Google

Andrea Morzenti
14 Jul, 2023
04 Mins read

A gennaio di quest’anno è stato creato il primo Sircle Energy Customer Oriented (ECO), ECO-Pegasus. L’obbiettivo di questo gruppo è, tra gli altri presentati nell’articolo della collega Patrizia Gardis, permettere a persone specializzate in diverse tecnologie di interagire, collaborare e per così dire “contagiarsi” a vicenda con lo scopo finale di poter soddisfare il cliente nel minor tempo possibile e con la qualità più alta raggiungibile.

A proposito di questo topic vi propongo un intervento svolto che risalta esattamente questo tipo di collaborazione ed interazione.

Il cliente ha portato alla nostra attenzione una problematica sulla sua infrastruttura GitLab On Cloud (GCP), sottolineando che tutte le pipeline dei progetti erano in fallimento.

Collaborando tra varie tecnologie (FullAM – Sircle Next – Unix – Platform Innovation) ci siamo accorti che non stavano funzionando i nodi runner collegati a gitlab, nodi in autoscale su GCP.

Il motivo del malfunzionamento era dovuto dal fatto che l’immagine del SO con cui venivano instanziati i nodi era stata deprecata da Google e resa quindi non più disponibile. Grazie ai colleghi più esperti su GitLab abbiamo provveduto a sostituire l’immagine di cui veniva effettuata la pull con una più recente tra quelle messe a disposizione da Google per il cloud e la problematica è stata così risolta:

Analizzando l'errore riportato durante il tentativo di creazione del nodo runner abbiamo evinto che l'immagine su cui si basava il nodo non veniva trovata:

Error creating machine: Error in driver during machine creation: googleapi: Error 404: The resource 'projects/ubuntu-os-cloud/global/images/family/ubuntu-####-lts' was not found

A seguire, da console web, ci siamo accertati che l'immagine fosse effettivamente deprecata e la controprova è stata data anche dal seguente comando lanciato tramite Cloud Shell, che ha restituito un output vuoto:

gcloud compute images list | grep -i 'ubuntu-####*'

Tornando sulla console web abbiamo verificato quale fosse l'immagine immediatamente successiva e se fosse disponibile.

Trovata la nuova versione dell'immagine e accertatoci che rispettasse eventuali matrici di compatibilità abbiamo provveduto ad aggiornare il file config.toml in modo che venisse utilizzata l'immagine corretta.

Questo ha permesso la creazione corretta dei nodi runner necessari e le pipeline sono ripartite correttamente.

Il tutto è stato individuato, analizzato e risolto in tempi molto brevi cosiderata la natura sconosciuta dell’incident.

Diversamente, tale problematica sarebbe stata rimbalazata tra i vari sircle tecnici citati precedentemente presenti in NGMS per individuare la natura del problema (se di tipo cloud, infrastrutturale, applicativo…) aumentando sensibilmente il tempo di gestione nonché quello di risoluzione.

Tutto questo è stato possibile proprio grazie alla collaborazione ed integrazione delle diverse teconolgie sopracitata, insieme ai punti chiave citati nell’articolo della collega che vi invito a leggere.

Ovviamente l’episodio è solo uno fra i molti casi analoghi il cui punto chiave resta l’interazione tra diverse tecnologie ma rende bene l’idea alla base del funzionamento del modello ECO e tra i nostri obbiettivi principali c’è proprio il portare questo tipo di collaborazione a livelli più alti in modo che tutti possano avere un minimo di competenza anche nelle tecnologie in cui non sono specializzati per accelerare ulteriormente i tempi di risoluzione di diverse casistiche.

English Version

In January of this year, the first Sircle Energy Customer Oriented (ECO) was created, ECO-Pegasus. The goal of this group is, among others presented in colleague Patrizia Gardis' article, to allow specialized individuals in different technologies to interact, collaborate, and, so to speak, "infect" each other with the ultimate aim of satisfying the customer in the shortest possible time and with the highest achievable quality.

Regarding this topic, I propose a case study that exemplifies this type of collaboration and interaction.

The customer brought to our attention an issue with their GitLab On Cloud (GCP) infrastructure, emphasizing that all project pipelines were failing.

By collaborating across various technologies (FullAM - Sircle Next - Unix - Platform Innovation), we realized that the runner nodes connected to GitLab, which were autoscaled on GCP, were not functioning.

The malfunction was due to the fact that the operating system image used for instantiating the nodes had been deprecated by Google and was no longer available. Thanks to our more experienced colleagues in GitLab, we replaced the deprecated image, which was being pulled, with a more recent one provided by Google for the cloud, and the issue was resolved:

Analyzing the error reported during the attempt to create the runner node, we deduced that the image on which the node was based could not be found:

Error creating machine: Error in driver during machine creation: googleapi: Error 404: The resource 'projects/ubuntu-os-cloud/global/images/family/ubuntu-####-lts' was not found.

Next, from the web console, we made sure that the image was indeed deprecated, and the confirmation was also obtained from the following command launched through Cloud Shell, which returned an empty output:

gcloud compute images list | grep -i 'ubuntu-####*'

Returning to the web console, we verified which image was immediately next and if it was available.

Once we found the new version of the image and confirmed that it complied with any compatibility matrices, we proceeded to update the config.toml file to use the correct image.

This allowed for the successful creation of the necessary runner nodes, and the pipelines resumed correctly.

The entire incident was identified, analyzed, and resolved very quickly, considering the unknown nature of the incident.

In contrast, this issue would have been bounced around among the various aforementioned Sircle technicians present in NGMS to determine the nature of the problem (whether it was cloud-related, infrastructure-related, application-related, etc.), significantly increasing the management and resolution time.

All of this was possible precisely due to the collaboration and integration of the various aforementioned technologies, along with the key points mentioned in the colleague's article, which I invite you to read.

Obviously, this episode is just one among many similar cases, the key point of which remains the interaction between different technologies. However, it gives a good idea of the underlying functioning of the ECO model, and one of our main objectives is to bring this type of collaboration to higher levels so that everyone can have a minimum level of competence even in technologies they are not specialized in, in order to further accelerate the resolution time of different scenarios.

Prev Next