Phyllo is a data gateway to access creator data from hundreds of source platforms like YouTube, Twitter, TikTok, Substack, Discord, Twitch, OpenSea, Shopify, etc. Phyllo creates the underlying infrastructure that links to every creator platform, maintains a live data feed to the systems that these platforms use to handle creator data, and delivers a normalised data set so that organisations can make simple but impactful use of creator data.
The creators connect their platform accounts using Phyllo’s Connect SDK. They provide consent to fetch the data from these platforms, and then Phyllo streams the data for each creator from these platforms.
Developers need the latest data of creators to make critical business decisions. So, we notify them during all events like
- when the creator’s account gets connected & synced for the first time
- when any data updates are available for the account
- when there is any change in account connection status.
Phyllo has several hundred thousand connected creator accounts whose data is updated frequently. The data change needs to be propagated back to developers in near real-time. Developers can also poll our APIs directly, which results in high network loads and is an inefficient use of resources.
To solve this problem, Phyllo has implemented webhooks where it delivers messages in a very efficient and reliable way whenever any data change is available. It has safeguarded its infra by reducing the number of API calls. It also makes the client integration simpler since they don’t have to implement logic to perform periodic polling.
Problem Statement
Since the product's launch, the number of connected accounts has been increasing exponentially. Due to this, we have to sync many connected accounts daily. As a result, the number of webhook requests has grown exponentially.
Key challenges which we encountered during the process
- Failures: Developers’ webhook URL is not responding correctly, and they are throwing errors.
- Reliable: Sending millions of webhooks in real-time is complex and error-prone.
- Security: Developers want security measures built in to trust the incoming webhook notifications.
In this blog post, we’ll walk you through some of the learnings and design choices that helped solve these challenges at Phyllo.
High-Level System Architecture
How We Made It Reliable and Fault-Tolerant
We aim to build the system with 100% reliability. It should be able to support all kinds of businesses and developers irrespective of their size and expertise in building the systems.
The design challenges that are important to solve:
- The developer’s webhook sends a 5xx error. The system might be down or is encountering issues due to reasons beyond our control. We consider these a delivery failure.
- They take more than the predefined allowed seconds to process the request.
In this section, we are providing architectural design decisions that helped us solve these challenges.
Selecting a message broker
We rely heavily on AWS for our infra. So we wanted a managed service that should be highly available and should be able to support message timeout (TTL) functionality. We went through ActiveMQ and this perfectly fulfilled our conditions.
Handling Errors
We should send the webhook reliably. We have designed a retry mechanism that takes care of the failure scenarios to achieve this. We have introduced an exponential backoff strategy and defined a reliable retry policy that should be able to cover 99% of our developer’s failures.
RETRY POLICY:
We have created five queues.
- MAIN QUEUE: Core message queue that receives the signal to send the webhooks. It maintains a message metadata in the header where it keeps track of the retry count (x-retry) and increments the retry count if the webhook has failed.
- RETRY QUEUE 1: All the messages which fail for the first time are published here. The queue is configured with a predefined TTL for 5 minutes for each message. Post TTL, the messages are sent to MAIN QUEUE again since it has been configured as a Dead Letter Queue. The messages are retried again.
- RETRY QUEUE 2: Here are all of the messages that fail for the second time. Each message in the queue has a predefined TTL of 60 minutes. Because MAIN QUEUE is configured as a Dead Letter Queue, the messages are forwarded to it again after TTL.
- RETRY QUEUE 3: All the messages which fail for the third time are published here. The queue is configured with a predefined TTL for 360 minutes for each message. Post TTL, the messages are sent to MAIN QUEUE again since it has been configured as a Dead Letter Queue.
- ALERT QUEUE: All the messages which fail for the fourth time are published here. We send an automated email alert to notify our developers about the fault in their system. Also, we notify our customer success team to educate the developers.
How We Secured it
--> Supporting the latest TLS security protocols
The older version of TLS has security concerns. Phyllo Production System does not support the older version of TLS 1.0 and 1.1. The integration should use newer versions of the TLS protocol.
--> Providing a set of webhooks IP
Developers need to trust the Phyllo system before accepting messages. We have provided a list of all the webhooks IPs we use to deliver the message. Developers whitelist these IPs.
--> Generating Hashed Signature
We generate a hash signature with each payload.
**X-Phyllo-Signature**The hash signature is calculated using HMAC with SHA256 algorithm; with your webhook secret set as the key and the webhook request body as the message.
Conclusion
We launched this system into production in December last year. We are optimizing it regularly and can see much better results now.
As of now, Phyllo is serving 3 million webhook requests daily with 100% reliability.
This article was originally published here.