Solving open source software issues can be difficult whether working with a support vendor or troubleshooting on your own. In this video, Doug Whitfield, an Enterprise Architect from OpenLogic by Perforce, shares tips on how to efficiently conduct root cause analysis, document incidents, and avoid bottlenecks so you can quickly address problems in your open source stack – and keep them from happening again.
Need Support for Your OSS Stack?
OpenLogic supports more than 400 open source technologies, including the top Enterprise Linux distributions, middleware, databases, frameworks, and cloud-native tools. Talk to an expert today to get SLA-backed technical support up to 24/7/365.
Additional Resources
- Datasheet – OpenLogic Technical Support and Services
- Blog – 10 Reasons Why Companies Choose OpenLogic for OSS Support
- Blog – Exploring the Differences Between Community FOSS, Open Core, and Commercial OSS
- White Paper – State of Open Source Report
Video Transcript
Hi, I’m Doug Whitfield, and I’m an Enterprise Architect with OpenLogic by Perforce.
No matter how much you plan and test your software and infrastructure, at some point, you are going to have a problem. It might be big. It might be small. But sooner or later, something will stop working the way it’s supposed to, and if you’re watching this video, you’re probably the person who has to figure out why – and how to keep it from happening again.
Today I’m going to talk about what you can do to keep your team calm and get to a resolution as quickly and painlessly as possible. The process might involve engaging with an open source community, a managed services provider, or a commercial support vendor like OpenLogic. However, no matter who is involved, there are some basic things you can do on your end to keep things on track.
Tip #1: Treat Your Issue Like a Crime Scene
Unlike a stolen crown jewel, your digital infrastructure can be easily reprovisioned, but when you do this reprovisioning, you need to make sure you don’t destroy the data that has the clues. Logrotate may destroy logs after a certain period, and Kafka has a default retention policy of 7 days. It is important to get that data away from automatic destruction.
This is particularly tricky with containers, but one way to implement this policy in containers is to make sure your logs are on persistent volumes. You do not want to lose your logs when your pod restarts.
Make sure your team knows what information to collect before an issue happens. If you’re partnering with a support vendor like OpenLogic, the faster you send us what we need, the faster we can help you. For example, for a PostgreSQL server, you will need to send us logs, configs, and a sosreport. Make sure the sosreport contains SAR data! To collect SAR data, you need to have sysstat installed, enabled, and running.
Another important piece of this is an architectural diagram. Do you have a connection pooler? How many clients talk to the database? These are important things to document in advance, so time is not wasted locating this information during an incident.
Tip #2: Avoid Bottlenecks and Knowledge Silos
A system, whether distributed or a monolith, is made up of a variety of parts: the CPU, the disk, the RAM, the network, and the code itself. And even these can be broken down further, based on caching and the OSI 7 layers. Even if you’ve moved to the cloud, these things still exist, and when something goes wrong, knowing who on your team understands them at a basic level will be important. Your team is always going to know more about your environment than anyone else, so get your colleagues in other departments involved if you need to. Talk to the folks in networking or storage, and don’t be afraid to bring them to a call with your support vendor or cloud provider.
Tip #3: Use Lower Environments for Root Cause Analysis (RCA)
With the volume of data companies have, complete replicas of production are not always possible. Still, you need a place to test upgrades and changes. Unfortunately, no RCA can be found on many issues because they are not reproducible. Running production in DEBUG or TRACE for long periods is simply not an option.
That is why we recommend having a lower environment as close to production as possible. That means if your production is in AWS, one lower environment should also be in AWS. Multi-cloud solutions have their place, but double-dipping your failover solution and your testing environment is not a good idea. Your lower environment needs to be something you can break.
It is also useful to have a lower environment that is not in the cloud. That way you can more easily tweak settings. You may be able to point a cloud provider to a specific setting they should tweak, or you might find that certain workloads are not a good fit for the cloud.
Tip #4: You Are Only as Strong as Your Weakest Link
Remember that you are only as strong as your weakest link. This means that you need to stay on top of your monitoring tool and OS upgrades. It also means it is exceedingly important to get off end of life software.
OpenLogic offers long-term support for AngularJS, Bootstrap, and CentOS, which buys you more time and protects you from medium and high severity CVEs. But ultimately, it’s our goal (and it should be yours) to get off EOL software. Our Professional Services team can provide guidance or help with the migration. And afterwards, we can provide technical support for the new technology.
Thanks for watching! Visit OpenLogic.com to learn about the open source technologies we support and our enterprise solutions.