Tuesday, August 5, 2025

CST438 - Week 6

 

CST438: Software Engineering – Week 6

We are approaching the final weeks of the Software Engineering course. The team has been working on the project for the third iteration of system testing and is making good progress. We look forward to finalizing the portal and publishing it to Amazon AWS in the coming week. This week focused on hands-on coding with Selenium libraries to automate the testing and verification of data entries in the basic learning management system (LMS). The process of software engineering and interaction with team members during this project was extremely helpful in gaining experience with team coding. As we progress, periodic communication with team members is essential for the project's success. Using the GitHub organization model allowed the project's progress to be shared among the team members as tasks advanced through each stage.

The reading for this week focuses on computers as a service and how Google scaled its infrastructure management by adapting to computing automation and embracing containers as resources. Google runs two types of jobs in its environment: batch jobs and serving jobs. The batch jobs are short-lived and easy to start, primarily focusing on the process. The serving jobs are mainly stateful, longer-running jobs, such as IIS services, load balancers, and web services. The serving jobs focus on full loads, redundancy, and latency. One of the challenges Google faced using containers is the locality of critical state. For instance, losing a container means losing its local state, such as a database or storage. The solution to this problem is to move all critical states to external resources, while leaving non-critical and local state recreation intact. In other words, containers become recyclables for non-critical states. How about batch jobs contingent on the local state? Google uses more frequent checkpoints for batch jobs to reduce the likelihood of loss. Additionally, caches are refreshed and are not an issue if lost.

Abstraction in design is necessary not only for scalability but also provides a layer of protection against indirect internal changes. For instance, Hyrum's law exposes the unexpected use of the PID range in a leaky abstraction design. PID max changes lead to log system breaks and naming collisions. Over time, we will need CaaS that provides dynamic abstraction for specific behaviors and withstands rapid changes – containers could help us with that.

Google Borg changed the handling of batch jobs and serving jobs on isolated machines, combining everything under a shared pool. As a result, the unified compute system consolidated tools and administration, lowering the CapEx and OpEx of operations. Borg also provided dynamic workloads by reclaiming resources from batch jobs, overprovisioning unused capacity to serving jobs, or allocating idle computing resources to batch jobs. Although the unified Borg system maintained a robust balance of resources and efficiency, it came at the cost of higher complexity. Moreover, Google made a significant investment and drew on Google's workforce to establish Brog as a robust infrastructure for containers, an option that is not readily visible to many companies.  Alternatively, public cloud offers easy-to-scale, offloading infrastructure management. Hybrid cloud is another flexible option that combines both on-premises and public cloud, extending its resources to the public cloud.


No comments:

Post a Comment

CST438 - Week 8

CST438: Software Engineering – Week 8 This the final week of CST438 Software Engineering course. I have learned so much from the labs, assig...