How to Approach Software System Architecture Design - Weekly Sharing

Summary : This article frames software architecture design as a decision-making art centered on quantified trade-offs rather than seeking single correct answers. It examines how architects navigate constraints and conflicting demands through six classic trade-off dimensions: performance versus cost, flexibility versus complexity, availability versus consistency, security versus usability, standardization versus optimization, and foresight versus timeliness. The piece outlines core design activities from input analysis to documentation, emphasizing Architecture Decision Records (ADRs) and systematic testing methodologies including single-scenario, mixed-scenario, and endurance testing to validate decisions and ensure architectural quality under real-world constraints.

Software system architecture design is not a mathematical problem with a single correct answer, but rather an art of decision-making that seeks dynamic balance among complex factors. Its core lies in "trade-offs" and "choices," and in today's socially managed environment, these trade-offs must be "quantifiable."

1. The Core Nature of Software Architecture

Software systems always operate within real-world environments characterized by limited resources and conflicting demands. Limited resources manifest as constraints across time, budget, manpower, computing power, storage space, and network bandwidth, collectively forming a multidimensional resource boundary system. This requires every architectural decision to be a compromise that seeks the optimal solution under multiple constraints. Conflicting demands appear in business stakeholders' desire to "have it all," such as demanding both rapid feature deployment and system stability, supporting millions of users while controlling hardware costs, and requiring flexibility for adding new features alongside simplicity to reduce maintenance complexity. The architect's core mission is to chart the optimal path within this space full of constraints and contradictions, and the output of architectural design is essentially a complete set of decisions based on trade-offs.

2. The Value of an Architect

An architect's value is determined not by how many trendy technologies they master, but by their ability to make the most reasonable trade-off decisions in complex and contradictory real-world environments. An excellent architecture is never the theoretically perfect or most technologically advanced one, but rather the most balanced and appropriate series of choices made under the specific constraints of time, resources, team capabilities, and business objectives. Trade-offs form the core content of architectural design, represent the architect's daily work, embody the soul of architecture, and serve as the key differentiator between ordinary technical experts and senior architects.

3. The Challenges and Classic Dimensions of Trade-offs

Architects must make trade-offs throughout the entire architecture design process, primarily across several classic dimensions, with each dimension presenting significant challenges.

Performance vs. Cost

This is the most intuitive trade-off: higher performance typically implies higher costs. Key decisions include determining whether further optimizing response time justifies investing in more expensive hardware, more complex caching architectures, or additional development time; assessing whether asynchronous processing can replace costly real-time computation; and defining what constitutes a "good enough" performance standard.

Flexibility vs. Complexity

Highly flexible and easily extensible systems often feature more complex internal structures. The critical decision lies in choosing between a simple monolithic architecture for rapid launch or adopting a more flexible microservices architecture to accommodate future growth. It is essential to recognize that microservices introduce complexities such as service discovery, distributed transactions, and network latency, requiring careful assessment of the team's readiness. The goal is to avoid both the "root of all evil" that is premature optimization and the lack of foresight that could necessitate a complete redesign.

Availability vs. Consistency

This classic trade-off from the CAP theorem is fundamental in distributed systems. When network partitions (P) occur, a choice must be made between consistency (C) and availability (A). For example, can an e-commerce shopping cart tolerate temporary data inconsistency to maintain availability (opting for AP), or must a banking transfer system sacrifice momentary availability to ensure strong data consistency (opting for CP)?

Security vs. Usability

Perfectly secure systems are often difficult to use. Decisions involve determining how frequently users should change passwords and whether multi-factor authentication (MFA) imposes an excessive burden on users, thereby striking a balance between security protocols and user experience.

Standardization vs. Optimization

In large organizations or complex systems, a choice must be made between enforcing a unified technology stack and patterns across all modules or allowing teams to select technologies best suited to their specific contexts. A unified stack reduces maintenance and collaboration costs but may stifle innovation or lead to suboptimal solutions in certain scenarios. Thus, rules must be established to balance "centralized governance" and "autonomy."

Foresight vs. Timeliness

Over-engineering and under-engineering are common pitfalls. Decisions require evaluating how much effort to invest in designing architectures for unknown future needs, determining whether accepting some technical debt to launch a simplified version (MVP) is acceptable to capture market opportunities, and balancing "what is needed now" with "what may be needed later."

4. Core Activities in Architecture Design

The core activities of architecture design revolve around the architect and begin with various input materials, including project background, system objectives, business requirement documents, and user stories or use cases. Based on these inputs, the architect conducts architecture design and decision-making, applying professional expertise to perform trade-off analysis. This process encompasses overall design as well as detailed architectural designs such as logical architecture, data architecture, technical architecture, and physical architecture. It also involves considering key functional designs, exception handling design, and system peripheral relationships. Throughout the design process, continuous evaluation and optimization are essential. Evaluations reference findings from performance testing and other sources, ultimately producing comprehensive architecture design documentation that clarifies architectural goals, principles, and detailed designs at all levels, providing clear guidance for subsequent development phases.

5. Two Key Activities in Architecture Design

Design (Decision-Making)

Design (decision-making) forms the core phase of architecture design. Architects must leverage their professional knowledge and experience to balance various factors such as performance, cost, and scalability, ensuring the designed architecture meets business needs and system objectives. A key output of this phase is the "Architecture Decision Record (ADR)," which preserves important decision knowledge for future reference.

Evaluation (Testing)

Evaluation (testing) runs throughout the entire architecture design process, serving as a critical activity to verify the effectiveness and rationality of design decisions. Through evaluation (testing), issues and potential risks in the design can be identified promptly, enabling optimization and adjustment of the design to ensure the final architecture delivers high quality and feasibility. Performance testing stands as the most crucial component, requiring scenario-based performance testing for key business functions. Common testing types include single-scenario testing, mixed-scenario testing, and endurance testing.

6. Evaluating Architecture Design

The key to evaluation lies in testing, with single-scenario testing, mixed-scenario testing, and endurance strength testing being three critical types of performance testing. While not representing a complete set of testing requirements, they are generally sufficient for implementation projects.

Single-Scenario Testing (Bench-marking)

This aims to obtain the optimal performance capability and baseline data of a single business process without interference, providing references for mixed-scenario testing while understanding the maximum processing capacity of that business. For example, conducting stress testing specifically on the "user login" interface by gradually increasing concurrency to observe its performance behavior.

Mixed-Scenario Testing (Load Testing)

This simulates real production environments by executing multiple business processes simultaneously according to specific proportions, thereby evaluating the system's overall performance and stability under comprehensive, complex business pressures. This represents the testing method closest to real-world conditions. For instance, simulating 1,000 online users with 20% performing login operations, 50% browsing products, 25% adding items to shopping carts, and 5% completing orders and payments.

Endurance Strength Testing (Durability Testing)

This applies stable (or high) pressure to the system and maintains operation for an extended period (such as 8 hours, 24 hours, or longer), aiming to detect inherent flaws like memory leaks, unreleased resources, or database connection pool overflows. For example, running the system at 80% of its maximum load capacity continuously for 12 hours while monitoring whether memory usage shows continuous growth.

Additionally, combining multiple sequentially dependent scenarios forms the currently popular "end-to-end testing." More fundamentally, it is essential to thoroughly test scenarios that verify "actual business conditions" by setting appropriate testing metrics (incorporating some buffer, such as using 1.3 or 1.5 times the peak value as the stress testing template) to ensure problem-free system delivery and launch. A commonly adopted principle is: design for 10 times the expected pressure, test for 5 times the pressure, and launch with capacity for 3 times the pressure. If stress testing targets are set too high, deploying excessive resources post-launch would result in resource waste, making appropriate goal setting crucial.