Performance Test Results¶
The performance of MWARE ESB was measured for GraphQL APIs. While the schema, the queries, and the responses were mimicked using the Starwars API, the backend used for the test is a Netty backend. The Netty service is configured so that the response is aligned with the GraphQL request in order to only provide the data requested by a particular query.
Tests were done using 50, 100, 200, 500, and 1000 concurrent users. Concurrent users mean that it consists of multiple users accessing the API Gateway at the same time. Three GraphQL queries were used for the tests. These queries were constructed with varying operation counts. First query with 1 operation, second with 4 operations, and third with 9 operations. It is important to note that when this operation count increases, the higher the workload for ESB. Details regarding the query variations are outlined below. The three queries and their responses can be found here.
Query | No. of Operations | Query Depth | Query Size | Response Size |
---|---|---|---|---|
Query 1 | 1 | 2 | 157B | 790B |
Query 2 | 4 | 2 | 1KB | 5KB |
Query 3 | 9 | 2 | 2KB | 27KB |
Two key performance metrics were used to measure the performance of each test.
- Throughput: This measures the number of API invocations that the ESB Gateway server processed during a specific time interval (e.g., per second).
- Response Time: This measures end-to-end processing time for an operation (of invoking an API using HTTPS protocol). The complete distribution of response times was recorded.
Deployment used for the test¶
Name | EC2 Instance Type | vCPU | Mem(GiB) |
---|---|---|---|
Apache JMeter Client | c5.large | 2 | 2 |
Apache JMeter Server 01 | c5.xlarge | 4 | 4 |
Apache JMeter Server 02 | c5.xlarge | 4 | 4 |
Netty HTTP Backend | c5.xlarge | 4 | 4 |
MWARE ESB | c5.large | 2 | 2 |
- Token type: JWT
- The operating system is Ubuntu 18.04
- MySQL version in RDS instance is 5.7
- Java version is JDK 11
Performance test scripts¶
All scripts used to run the performance tests and analyze results are in the following repositories:
Results¶
The complete results can be found here.
The following graph depicts the throughput changes based on the number of concurrent users for different GraphQL queries.
Key observations:
-
More concurrent users mean more requests to the ESB Gateway. Therefore, the throughput of the ESB Gateway increases as the number of concurrent users accessing the APIs increases. The maximum throughput is observed around 50 and 100 concurrent users, and the throughput degrades slightly after 100 concurrent users due to resource contentions in the system. The degradation point mainly depends on hardware resources. Note that a similar pattern can be seen across all three queries.
-
The highest throughput values were observed for Query 1, while the lowest values were observed for Query 3. For a given API invocation, the request payload, which includes the GraphQL query, is processed by the ESB Gateway before the request is sent to the backend. Operation count within this query has an effect on performance as we provide QoS (quality of service) based on operations. Similarly, the query size has an effect on the performance as the number of fields requested by the query is increased eventhough the query depth is kept constant. The ESB observes a considerable overhead when the query size and operation count keeps increasing due to the payload processing time. As the query sizes used for the tests are 157B, 1KB and 2KB, and the operation counts used for the tests are 1, 4 and 9, the growth is exponential and that justifies the throughput numbers depicted in the above plot.
The following graph depicts the average response time changes based on the number of concurrent users for different GraphQL queries.
Key observations:
-
The average response time increases with the number of concurrent users. As the number of requests to serve increases with more users, there are more resource contentions. Therefore, the number of concurrent users served by the API Gateway needs to be decided on the required response time limits. Again, the average response time also varies as the hardware resources change. For example, a user can achieve a lower response time if the number of CPUs allocated is increased.
-
Query 3 shows the highest response time due to the performance overhead caused by the payload processing time at the ESB Gateway.
Let’s look at the 90th, 95th, and 99th response time percentiles. It is useful to measure the percentage of requests that exceeded the response time value for a given percentile. A percentile can be used to determine the percentage of requests completed below the particular response time value.
For example, when there are 100 concurrent users, the 90th response time percentile for Query 1 is 102ms. This means that 10% of the requests have taken more than 102ms to respond. Similarly, the 99th response time percentile for Query 1 is 201ms, which means that 99% of the requests have been completed within 201ms.
The figure below represents the percentile values for Query 1.
Key observations:
-
Query 2 and Query 3 scenarios are relatively slower than the preceding results due to the performance overhead caused by the payload processing at the ESB Gateway.
-
Response Times percentiles are less than 750ms up to 300 concurrent users for all three queries.
Note
1000 concurrent users mean a lot, and it is not very common. It is recommended to scale horizontally or vertically to support more concurrent users with acceptable response times. When scaling horizontally, two or more Gateway nodes need to be used with a load balancer. Another load test must be carried out to measure the performance after scaling.
In order to see the memory usage, the Garbage Collection (GC) logs in the ESB were enabled using the following flags.
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:"$CARBON_HOME/repository/logs/gc.log
In addition, the GC log for each performance test was analyzed using the GCViewer. The GC Throughput was calculated for each test to check whether GC operations are not impacting the performance of the server. The GC Throughput is the time percentage of the application, which was not busy with GC operations. For example, if the application ran for 10 minutes and 30 seconds were taken for GC operations, the GC Throughput is (1 - 30/(10*60)) * 100 = 95%. A GC Throughput over 90% is good, and that means the allocated heap was enough to handle all concurrent requests, which allocate objects in the memory. On the contrary, if the GC throughput is a lower value, it indicates that the allocated memory is not enough for ESB to function efficiently.
The following chart depicts the GC Throughput (%) based on the number of concurrent users for different GraphQL queries.
Top