Skip to main content

16 posts tagged with "log"

View All Tags

· 8 min read
Kevin Skrei

Lessons Learned From Logging

Intro

Writing software can be extremely complex. This complexity can sometimes make finding and resolving issues incredibly challenging. At Arccos Golf, we often run into these kinds of problems. One of the unique things about golf is that a round of golf is like a snowflake or a fingerprint, no two are alike. A golfer could play the same course every day without ever replicating a round identically.

Thus, trying to write software rules about the golf domain inevitably leads to bugs. And since no two rounds of golf are the same, trying to reproduce a user issue they encounter on the golf course is nearly impossible. So, what have we done to attempt to track down some of these issues? You guessed it, logging.

Logging has proven to be an indispensable tool in my workflow, especially when developing new features. This article aims to guide you through key questions that shape a successful logging strategy along with some considerations around performance. Concluding with practical insights, it features a few case studies from Arccos Golf, demonstrating how logging has been instrumental in resolving real-world bugs.

The Arccos app showing several shot detection modes available (Phone & Link)

Figure 1: The Arccos app showing several shot detection modes available (Phone & Link)

Should you log?

When trying to track down a bug or build a new feature, and you're considering logging, the first question to ask yourself is, “Is logging the right choice?”. There are many things to consider when deciding whether to add logging to a particular feature. Some of those considerations are:

  1. Do I have any other tools at my disposal that could be useful if this feature fails? This could be an analytics platform that shows screenviews, and perhaps logs metadata in another way besides traditional logging.
  2. Will adding logging harm the user in any way? This includes privacy, security, and performance.
  3. How will I actually get or view these logs if something goes wrong?

What to log?

Logging should be strategic, focusing on areas that yield the most significant insights.

  1. Identifying Critical Workflows: Determine which parts of your app are crucial for both users and your business. For instance, in a finance app, logging transaction processes is key.
  2. Focusing on Error-Prone Areas: Analyze past incidents and pinpoint sections of your app that are more susceptible to errors. For example, areas with complex database interactions or integrations with 3rd party SDKs might require more intensive logging.

What About Performance?

One of the primary challenges with logging is its impact on performance, a concern that becomes more pronounced when dealing with extensive string creation. To mitigate this, consider the following tips:

  1. Method Calls in Logs: Be wary of incorporating method calls within your log statements. These methods can be opaque, masking the complexity or time-consuming operations they perform internally.
  2. Log Sparingly: Practice judicious logging. Over-logging, particularly in loops, can severely degrade performance. Log only what is essential for debugging or monitoring.
  3. Asynchronous Logging: If your logging involves file operations or third-party libraries, always ensure that these tasks are executed on a background thread, thus preserving the main thread's responsiveness and application performance.

Implementing these strategies will help you strike a balance between obtaining valuable insights from logs and maintaining optimal application performance. I have found that you develop an intuition about what to log the more you practice and learn about the intricacies of your system.

How Do I Access The Logs?

The most straightforward and easiest method to access your applications logs is utilizing a third-party software tool like Shipbook, which offers the convenience of remote, real-time access to your logs.

Finally, I wanted to showcase a few stories illustrating how logging has helped us solve real-world production issues, along with some lessons learned about logging performance.

The 15-Minute Mystery

Our Android mobile app faced an intriguing issue. We noticed conflicting user feedback reports: one showed normal satisfaction, while another indicated a significant drop. The key difference? The latter report included golf rounds shorter than 15 minutes.

Upon investigating these brief rounds, we found that their feedback was much lower than usual. But why? There were no clear patterns related to device type or OS.

The trail of breadcrumbs started when we examined user comments on their rounds, many of which mentioned, "No shots were detected." Diving into the logs of these short rounds, a pattern quickly emerged. We repeatedly saw this line in the logs:

    [2023-12-01 14:20:09.322] DEBUG: Shot detected with ID: XXX but no user location was found at given shot time

This means that we detected a user took a golf shot but we didn’t know where they were on earth to place the shot at a particular location. This was unusual because we had seen log lines like this in our location provider which requests the phones GPS location:

    [2023-12-01 14:20:08.983] VERBOSE: Received GPS location from system with valid coordinates

So, we were clearly receiving location updates at regular intervals but we couldn’t associate them when a shot was taken by the user. After some further analysis, we discovered this line:

    [2023-12-01 14:20:09.321] VERBOSE: Attempting to locate location for timestamp XXX but requested source: “link” history is empty. Current counts: [“phone”:60, “link”:0]

We have a layer above our location providers that handles serving locations depending on which method the user selected for shot detection mode (either their Phone or their external hardware device “Link”). It was attempting to find a location for “Link” even though all of these rounds should have been in phone shot detection mode. Finally, we located this log line:

    [2023-12-01 14:14:33.455] DEBUG: Starting new round with ID: XXX and shot detection mode: Link … { metadata: { “linkConnected”: false, linkFirmwareVersion: null }... }

Once we analyzed this log line it became immediately obvious - The app was starting the round with the incorrect shot detection mode. Some rounds were started with shot detection mode of Link even if the Phone was selected in the UI (Figure 2).

The Arccos app showing a round of golf being played and tracked

Figure 2: The Arccos app showing a round of golf being played and tracked

We eventually identified the issue and it was due to some changes in our upgrade pathing code if users had certain firmware and prior generations of our Link product. Thankfully, this build was early in its incremental rollout and we were able to patch it quickly.

This experience highlighted the crucial role of widespread effective logging in mobile app development. It allowed us to quickly identify and fix an issue, reinforcing the importance of comprehensive testing and attentive log analysis.

When Too Much Detail Backfires

Dealing with hardware is especially difficult given you can rarely easily get information off of the hardware device. We often rely on verbose logging during the development phase to diagnose communication issues between hardware and software. This approach seemed foolproof as we added a new feature to our app, implementing detailed logging to capture every byte of data exchanged with the hardware of our new Link Pro product. In the controlled environment of our office, everything functioned seamlessly in our iOS app.

While on the course testing our app faced an unforeseen adversary: it began to get killed by the operating system. The culprit? Excessive CPU usage. Our iOS engineer, armed with profiling tools, discovered a significant CPU spike during data sync with the external device. Our initial assumption was straightforward – perhaps we were syncing too much data too quickly.

To test this theory, we modified the app to sync data less aggressively. This change did reduce app terminations, but it was a compromise, not a solution. We wanted to offer our users real-time experience without interruptions. Digging deeper into the profiling data, we uncovered the true source of our problem. It wasn't the Bluetooth communication overloading the CPU; it was our own verbose logging.

The moment we disabled this extensive logging, the CPU usage dropped dramatically, bringing it back to acceptable levels. This incident was a stark reminder of how even well-intentioned features, like detailed logging, can have unintended consequences on app performance. We decided to use a remote feature flag paired with a developer setting to be able to toggle detailed verbose logging of the complete data transfer only when necessary.

Through this experience, we learned a valuable lesson: the importance of balancing the need for detailed information with the impact on app performance. In the world of mobile app development, sometimes less is more. This insight not only helped us optimize our Link Pro product but also shaped our approach to future feature development, ensuring that we maintain the delicate balance between functionality and efficiency.

Afterword

In conclusion, our experiences at Arccos Golf have demonstrated the invaluable role of logging in software development. Through it, we’ve successfully navigated the complexities of writing golf software, turning unpredictable challenges into opportunities for improvement. Tools like Shipbook have been instrumental in this journey, offering the ease and flexibility for effective log management. I hope I’ve illustrated that logging is more than just a troubleshooting tool; it's a crucial aspect of understanding and enhancing application performance and user experience.

· 9 min read
Nikita Lazarev-Zubov

iOS Log

Logging is the process of collecting data about events occurring in a system. It’s an indispensable tool for identifying, investigating, and debugging incidents. Every software development platform has to offer a means of logging, and iOS is no exception.

Being a UNIX-like system, iOS supported the syslog standard from as long as iOS has been around (since 2007). In addition, from 2005 Apple System Log (ASL) has been supported by all Apple operating systems. However, ASL isn’t perfect: It has multiple APIs performing similar functions to one another; it stores logs in plain text, and it requires going deep into the file system to read them. It also doesn’t perform very well, because string processing happens mostly in real time. While ASL is still available, Apple deprecated it a few years ago in favor of an improved system.

Apple’s Unified Logging

In 2016, Apple presented its replacement for ASL, the OSLog framework, also known as unified logging. And “unified” is right: It has a simple and clean API for creating entries, and an efficient way of reading logs across all Apple platforms. It logs both kernel and user events, and lets you read them all in a single place.

Beyond its impressive efficiency and visual presentation, unified logging offers performance benefits. It stores data in a binary format, thus saving a lot of space. The notorious observer effect is mitigated by deferring all string processing work until log entries are actually displayed.

Let’s dive in and put unified logging to work.

Unified Logging in Depth

Writing and Reading Logs

The easiest way to create an entry using OSLog is to initialize a Logger instance and call its log method:

    import OSLog

let logger = Logger()
logger.log("Hello Shipbook!")

After running this small Swift program, the message "Hello Shipbook!" will appear in the Xcode debug output:

Xcode debug output

Figure 1: Xcode debug output

Since unified logging stores its data in binary format, reading that data requires special tools. This is why Apple introduced the brand new Console application alongside the framework. This is how the log message appears in Console:

The Console application for reading unified logging messages

Figure 2: The Console application for reading unified logging messages

As you can see, unified logging takes care of all relevant metadata for you: Human-readable timestamps, the corresponding process name, etc.

Another often underestimated way of reading logs is by means of the OSLog framework itself. The process is straightforward: You only need to have a specific instance of the OSLogStore class and a particular point in time that you’re interested in. For example, the code snippet below will print all log entries since the app launch:

    do {
let store = try OSLogStore(scope: .currentProcessIdentifier)
let position = store.position(timeIntervalSinceLatestBoot: 0)

let entries = try store.getEntries(at: position)
// Do something with retrieved log entries.
} catch {
// Handle possible disk reading errors.
}

This might be useful in testing, or for sending logs to your servers.

Log Levels

For grouping and filtering purposes, logs are usually separated into levels. The levels signify the severity of each entry. Unified logging supports five levels, with 1 being the least problematic and 5 being the most severe. Here’s the full list of supported levels and Apple’s recommendations for using them:

  1. The debug level is typically used for information that is useful while debugging. Log entries of this level are not stored to disk, and are displayed in Console only if enabled.
  2. The info level is used for non-essential information that might come in handy for debugging problems. By default, the log messages at this level are not persisted.
  3. The default level (also called notice level) is for logging information essential for troubleshooting potential errors. Starting from this level, messages are always persisted on disk.
  4. The error level is for logging process-level errors in your code.
  5. The fault level is intended for messages about unrecoverable errors, faults, and major bugs in your code.

Beyond their use in classifying error severity, log levels have an important impact on log processing: The higher the level, the more information the system gathers, and the higher the overhead. Debug messages produce negligible overhead, compared to the most critical (and supposedly rare) errors and faults.

Here’s how different levels can be used in code:

    logger.log(level: .debug, "I am a debug message")
logger.log(level: .info, "I am info")
logger.log(level: .default, "I am a notice")
logger.log(level: .error, "I am an error")
logger.log(level: .fault, "I am a fault, you're doomed")

And this is how those entries look in Console:

Logs of different levels in Console

Figure 3: Logs of different levels in Console

The debug and info messages are only visible here because the corresponding option is enabled. Otherwise, messages would be shown exclusively in the IDE’s debug output.

Subsystems and Categories

Logs generated by all applications are stored and processed together, along with kernel logs. This means that it’s crucial to have a way to organize log messages. Conveniently, Logger can be initialized using strings denoting the corresponding subsystem and the category of the message.

The most common way (and the method recommended by Apple) to denote the subsystem is to use the identifier of your app or its extension in reverse domain notation. The other parameter is used to categorize emitted log messages, for instance, “Network” or “I/O”. Here’s an example of a logger for categorized messages in Console:

    let logger = Logger(subsystem: "com.shipbook.Playground",
category: "I/O")

Log categorization in Console

Figure 4: Log categorization in Console

Formatting Entries

Static strings are not the only type of data we want to use in logs. We often want to log some dynamic data together with the string, which can be achieved with string interpolation:

    logger.log(
level: .debug,
"User \(userID) has reached the limit of \(availableSpace)"
)

Strictly speaking, the string literal passed as a parameter to the log method is not a String, it’s an OSLogMessage object. As I mentioned before, the logging system postpones processing the string literal until the corresponding log entry is accessed by a reading tool. The unified logging system saves all data in binary format for further use (or until it’s removed, once the storage limit is exceeded).

All common data types that can be used in an interpolated String can also be used inside an OSLogMessage: other strings, integers, arrays, etc.

Redacting Private Data

By default, almost all dynamic data—i.e., variables used inside a log message—is considered private and is hidden from the output (unless you’re running the code in Simulator or with the debugger attached). In Figure 5, below, the string value is substituted by “<private>”, but the integer is printed publicly.

Redacted private entry

Figure 5: Redacted private entry

Only scalar primitives are printed unredacted. If you need to log a dynamic value—like string or dictionary—without redacting, you can mark the interpolated variable as public:

    logger.log(
level: .debug,
"User \(userID, privacy: .public) has reached the limit of \(availableSpace)"
)

Apart from public, there are also private and sensitive levels of privacy, which currently work identically to the default level. Apple recommends specifying them anyway, presumably to ensure that your code is future-proof.

In many cases, you will want to keep data private while identifying it in logs massif. This option could come in handy for filtering out all messages concerning the same user ID, for example, in which case the variable can be hidden under a mask:

    logger.log(
level: .debug,
"User \(userID, privacy: .private(mask: .hash)) has reached the limit of \(availableSpace)"
)

The value in the output will be meaningless, but identifiable:

Private data hidden under a mask

Figure 6: Private data hidden under a mask

Performance Measuring

A special use case of unified logging is performance measurement, a function that was introduced two years after the system was first released. The principle is simple: You create an instance of OSSignposter and call its methods at the beginning and end of the piece of code that you want to measure. Optionally, in the middle of the measured code you can add events, which will be visible on the timeline when analyzing measured data. Here’s how it looks in assembly:

    let signposter = OSSignposter(logger: logger)
let signpostID = signposter.makeSignpostID()

// Start measuring.
let state = signposter.beginInterval("heavyActivity",
id: signpostID)

// The piece of code of interest.
runHeavyActivity()
signposter.emitEvent("Heavy activity finished running",
id: signpostID)
finalizeHeavyActivity()

// Stop measuring.
signposter.endInterval("heavyActivity", state)

You can analyze this data using the os_signpost tool in Instruments:

Performance measurement using OSSignposter

Figure 7: Performance measurement using OSSignposter

Conclusion

Apple’s unified logging is both powerful and simple to use. As its name suggests, the system can be used with all Apple platforms: iOS, iPadOS, macOS, and watchOS, using either Swift or Objective-C. Unified logging is also efficient thanks to its deferred log processing and compressed binary format storage. It mitigates the observer effect and reduces disk usage.

Gathering logs using OSLog is a great option when you’re debugging or have access to the physical device. However, when it comes to accumulating logs remotely, you need a different solution. Shipbook can take care of your needs by allowing you to gather logs remotely. Shipbook offers a simple API similar to OSLog’s, and a user-friendly interface that helps you to observe and analyze collected data.

Importantly, Shipbook integrates with Apple’s Unified Logging System, enhancing your workflow by letting you utilize the familiar tools provided by Apple when your device is connected. This integration ensures a seamless transition between remote logging and local diagnostics, making Shipbook a versatile tool for developers who require both remote data collection and the capabilities of Apple’s native logging system.

· 13 min read
Yossi Elkrief
Elisha Sterngold

Yossi Elkrief

Interview with Mobile Lead at Nike, Author, Google Developers Group Beer Sheva Cofounder, Yossi Elkrief

Thank you for being with us today Yossi, would you like to begin with sharing a little bit about your position at Nike, and what you do?

I joined Nike for a bit more than two years now. I am head of mobile development in the Digital Studio of innovation. It is a bit different from regular app development but we still work closely with all the teams in WHQ, Nike headquarters in the US as well as Europe, China, and Japan. We really work across the globe, and we do some pretty cool things in the realm of innovation. We develop new technologies and try to find ways to harness new technologies or algorithms to help Nike provide the best possible service to our consumers.

I have experience in mobile for the past 13, almost 14 years now. I’ve been involved in Android development since their very first Beta release, even a bit before that. I also worked on iOS throughout the years, and I’ve been involved in a couple of pretty large consumer based companies and startups.

At Nike we have a few apps, such as: Nike Training Club (NTC), Nike Running Club (NRC), and the Nike app made for consumers, where you can purchase Nike’s products.

We work with all of those teams and other teams within Nike, on various apps as well as in-house apps that are specific creations of our studio, where we work on creating new innovative features for Nike.

One major project that is currently working on completing roll out is Nike Fit, recently launched in China and Japan. Nike Fit, is aimed at helping people shop Nike shoes online and hopefully for other apparels in the near future.

How is it working for Nike, as a clothing company, with a background of working mainly for tech companies?

Nike is a company with so much more technology than people realize. We are not just a shoe company or a fashion company.

Our mission is to bring inspiration and innovation to every athlete1 in the world.

We use a tremendous amount of technology to transform a piece of fabric into a piece in the collection of the Nike brand. Nike may be more faceforward than companies that I’ve worked for in the past, but there is a vast array of technologies that we work with in Nike, or work on building upon, to make Nike the choice brand for our customers, now and in the future.

One of the highest priorities at Nike is the athlete consumers. Because Nike is a brand that is specifically designed and geared toward athletes. We therefore try to keep all of Nike’s athletes at the forefront in terms of their importance to the company. Consumer facing, most of Nike’s products are not the apps. All of my previous experiences in app companies or technical companies that provide a service are pretty different from what I focus on now at Nike. So everything we do at Nike, all the services we provide, are to help serve athletes in their day to day activities, whether this be in sports for professional athletes, or for people with hobbies like running, football, or cycling and so on.

Everything I focus on has to do with providing athletes with better service while choosing their shoes, pants, or all the equipment they need, and that Nike provides so they can best utilize their skills.

Can you tell us a bit about what went into writing your book “Android 6 Essentials”? Do you feel that writing the book improved your own skills as a developer?

I write quite a lot. I don’t get to write as many technical manuals as I’d like, but I do write quite a few documentations, technical documents, and blog posts. Writing the book was a different process, but I really wanted to engage a technical audience, as this audience is very different from that of a poem, or story, which is less for use and more for enjoyment.

Writing the book made me a better person in general because I was working full time in the capacity of my position at the company that I was with at the time, and then on top of all of my regular responsibilities, in order to be able to keep to schedule and hit all of the milestones, and points that I wanted to cover in my book. I had to be very organised and devoted to the project. I had to juggle work, and family, and all of my other responsibilities as well, so I divided my time to make sure I could meet all of my goals. The process was really quite fun because in the end I had something that I built and created from scratch.

I would recommend it, because it gives you an added value that no one else will have, and in the end you have a final product that you can show someone, and say that it was your creation. I think the whole process makes you a better developer, and it helps you understand technology better, because you need to understand technology at a level and to a degree of depth in order to then explain it in writing to someone else.

You also took part in co-founding Google Developers Group Beer Sheva, which is also about sharing knowledge and bettering yourselves as developers, can you tell us a little about that process?

The main aim of Google Developers Group is sharing knowledge. When we share knowledge we can learn from everyone. Even if I built each of the pieces of a machine myself, when I share it with someone else, they can always bring to light something that I was unaware of; some new and interesting way of using it. Sharing with people helps more than just the basic value of assistance. Finding a group of peers that share the same desire or passion for technology, knowledge, and information, this is a key concept in growth, for everyone in general.

On that note, we are seeing an interesting trend in development: even though mobile apps are becoming increasingly more complex, the industry has succeeded in reducing the occurrence of crashes. Is that your experience as well and if so, what are, in your eyes, the main reasons for this shift?

It’s really a two part answer.

Firstly, both Google and Apple are providing a lot more information, and are focusing a lot more on user experience in terms of crashes, app not responding, bad reviews etc. Users are more likely to write a good review if you provide more information, or create a better service with more value for them. Consumers in general are more interested in using the same app, the same experience, if they love it. So they will happily provide you with more information so that you can solve its issues, and keep using your app rather than trying something new. We call them Power Users or Advanced Users. With their help, we can keep the app updated and solve issues faster.

The second part of the answer is that all of the tools, ID integrations, shared knowledge, documentation, has been vastly improved. People understand now that they need to provide a service that runs smoothly with as little interference as possible for the user and they do their part to make sure that these issues remain as low as possible in the apps. We want a crash rate lower than 0.1%. So we work 90% of our time to build an infrastructure that will remain robust and maintain top quality, with a negligible amount of crashes, exceptions, and app issues, in general, that will harm and affect the user experience.

Do you believe that all bugs should always be fixed? If not, do you have ways of defining which ones do not need to be fixed?

As a perfectionist, yes, we want to solve all of the app issues. But in terms of real life, we work with a simple process. We look at the impact of the bug. How many users are being impacted? What is the extent to the impact? What does the user have to do in order to use the service? Is it just a simple work around or is it preventing the user from using an important part of the app?

Do you close insignificant issues, or are they kept open in a back office somewhere?

No, so we are very careful and organized about all of the issues that we have in the system. We document every issue with as much information as possible. Sometimes you can fix an issue with dependency and provide a new version for some dependencies and then because of all the interactions of the code versions you have some issues being solved even though you didn’t do anything. So for example, this doesn’t happen much, but sometimes we have issues in the backlog that can remain unsolved for more than a month.

What is your view on QA teams? Some companies have come out saying that they don’t use QA teams and instead move that responsibility to the developer team. Do you believe that there should be a QA Team?

I believe that companies should have a Quality Assurance team, which is sometimes also called QE, Quality Engineering. I think as a developer, working on various platforms, when you implement a new feature or service, give or take on the architecture of the technology, the actual issue can be quite difficult to find. This requires a different point of view than the developer. When you develop or write the code, you have a different point of view in mind then users often have when it comes to using the app. 90% of the time users will actually often behave differently than developers anticipated when writing the code. So when we design the feature, sometimes we need to understand a bit better how users will interact. We have a product team that we involve and engage on an hourly basis. The same goes for QA. We use QA in our Innovation Studio as well, but the same goes for our apps. We are constantly engaging QA to see how to both resolve issues and understand better how the user will interact with the app.

What is your position on Android Unit testing: How are the benefits compared to the efforts?

With testing in general, some will say it's not necessary at all and will just rely on QA. I don’t side with either. I think it is a mix. You don't need to unit test every line of code. I think that is excessive. Understanding the architecture is more important than unit testing. It's more important to understand how and why the pieces of the puzzle interact- to understand why to choose one flow over another, than to just unit test every function. Sometimes pieces of the puzzle are better understood with unit testing, but it is not necessary to unit test everything. That said, the majority of our code does undergo UI and UX testing.

What do you think about the fact that with Kotlin, you don’t state the exception in the function, this is unlike Java or Swift, which both require it. Which approach do you prefer?

I think for each platform there are different methods of working. Both approaches are fine with me. I think the Kotlin approach for Android, or for Kotlin in general, gives the developer more responsibility as to what can go wrong. You need to better understand the code and the reasons behind what can go wrong with exceptions when working with Kotlin. You can solve it using annotations and documentation, but in general people need to understand that if something can go wrong it will. They need to understand then how to solve it within the runtime code that they are writing, or building. If you are using an API, then API documentation will provide you with a bit more knowledge as to what is happening under the surface, and in terms of architecture, yes you need to know that when using an API function call or whichever function you are using within your own classes, you still need to interact with them properly, so it drives you to write better code handling for all exceptions.

Do you feel the fragmentation of devices or versions in Android is a real difficulty?

Yes, we see different behaviors across devices and different versions, and making sure that the app runs smoothly across all platforms can be a bit rough. But even so, it is a lot better than what we had in the past. I hope that as we progress in time, more and more devices will be upgraded to use an API level that is safer to use, and will mitigate fragmentation. Right now, some of the features that we are building, for example API 24 and above, have major progress in comparison to API 21 and above.

As a final question, which feature would you dream that Android or Kotlin would have?

I never thought of that, because, a week ago I would say camera issues on Android. But a month ago I would say, running computer vision in AI on Android on different devices. Camera issues are due to different hardwares. Google is doing a relatively good job in trying to enforce a certain level of compliance and testing on all devices. You have quite a few tests the device has to pass both in hardware and in API. But we still see many devices attempt to bypass, or give false results to the tests.

I would say giving us support for actual devices as far back as five to seven years, instead of three, and giving an all around better camera experience over all devices.

Thank you very much to Yossi Elkrief for your time and expertise!


Shipbook gives you the power to remotely gather, search and analyze your user logs and exceptions in the cloud, on a per-user & session basis.

Footnotes

  1. If you have a body, you are an athlete.

· 13 min read
Dotan Horovits
Elisha Sterngold

Dotan Horovits

Interview with Technology Evangelist | Podcaster | Speaker | Blogger at Logz.io

Dotan started his career as a developer, moved to become a System Architect and Director of Product Management. This gives a broad understanding of various aspects of app development, which is extremely relevant for this blog where we focus on app quality in a broad sense. Today, Dotan is a Technology Evangelist, Podcaster, Speaker and Blogger

On LinkedIn you present yourself as Principal Developer Advocate and Technology Evangelist. Can you describe what you mean by these titles?

Essentially what this means is that I help the developer community use the products at my company Logz.io, as well as with the broader open source projects in the domain, in my case DevOps and observability domains by sharing knowledge about tools, best practices, design patterns, etc. I also keep watch on the community pulse to make sure our own products meet the needs and desires of the community. I make sure it's a bidirectional channel where I can bring in the actual feedback and voice of our users into our company practice.

How do you see the place of logs in monitoring and solving issues?

Logs are an essential part of monitoring obviously, maybe the oldest and most established method of understanding why things happen in our system, especially why things go wrong. This is due to the fact that the developer who wrote the specific piece of code, essentially creates the output for what is going on in that piece of code. Therefore logs are just a very rich and verbose piece of information, and are the basic foundation for monitoring. It is important to say that logs today are not enough. You need to obtain fuller observability. Logs are one essential pillar, but they are usually accompanied by metrics and traces as well. These are the classic pillars of Observability upon which, together with potentially additional signals, one can get the broader picture of what goes on in the system.

Where do you see logging platforms in 5 years?

Logging has traditionally been very focused on the human aspect of the users. As I said, the developer outputs what goes on in the business logic that he wrote. Usually, it was outputted as plain text for humans to read. Plain text logs that are created for humans to read, they were unstructured and simply not scalable. These days when you have such a massive amount of logs in today’s distributed systems and microservices, we must use automated systems, and I foresee that these automated solutions will continue to grow and develop and we will see that people will be increasingly using log management systems and automation to aggregate, analyze and visualize log data. For that purpose we will see a growing shift going from unstructured to structured formats, plain text to machine-readable formats such as JSON. We will see more work toward log enrichment with metadata in order to attain better system insights, we will see more advanced centralized log collection, ingestion, indexing, and management. Instead of developers having to traverse hundreds of servers, there will be one centralized place where one can visualize, analyze, register, and create automatic alerts based on the logs. This shift is already underway and will accelerate. Going beyond that, as I said, logging will become much more integrated with the other signals in the observability domain, everything from the correlation between logs and metrics, logs and traces, as well as traces back to logs, and by traces I am referring to Distributed Tracing. This is another way in which I foresee logs will be integrated with other signals and events. The final direction perhaps in which I see for the future development of logging is through AI and machine learning. I believe that they will take an increasingly larger role with respect to analyzing and processing logs efficiently as well as proactively because as opposed to the constraints people face when attempting similar processes, AI and machine learning systems have the power to develop and advance incomparably faster.

I know that you are a big champion of Open Source. On your LinkedIn profile you write that your passion is: Tech, Innovation, and the power of communities & open source. I would like to hear your reasoning behind this, but maybe you can start by talking about what limitations you experience with Open Source.

I am a great advocate of Open Source. It’s hugely popular. Many developers already have the skill sets, so you have lots of users who are very familiar with the platforms, and a massive community pushing it forward. There are however several issues with it. The first thing that comes to mind is a common misconception. Often people think Open Source is equivalent to free. It is not zero cost. Yes you may not need to pay for a software or service license however you still need to invest in developers, devops, to install, operate, manage and scale. It is important to understand this. Many times new startups will begin with Open Source, as it is relatively manageable and cheap, however as they scale out they realize that the do-it-yourself model of Open Source is not as trivial as the organization scales out.

Another thing that I often encounter with individuals who are just starting to use Open Source, is as they adapt it to use in production, they realize that Open Source projects are really focused on one core capability. They put less focus on peripheral or administrative capabilities such as user and role management, team collaboration, single sign on (SSO), easy deployment, and so on. These features, I call them commercial grade or enterprise grade, and are not usually an inherent part of Open Source Software, but are needed to run things in production environments. Some vendors develop these additional capabilities on top of the open source project: so Grafana Labs or Elastic, or my company Logz.io develop an additional wrap-around to Open Source to provide another layer of enterprise grade capabilities that is applicable for businesses in production as well.

Recently there has been a security breach with log4j, which is also based on Open Source. Is Open Source less safe than Closed Source?

The Log4j CVE was definitely painful for many of us, including for us in my company. Actually many Java products are based on log4j, but we are happy to say that Logz.io is based on Logback, so we are not actually users of log4j, so specifically for us it was pretty smooth all things considered. Then again, the claim that Open Source is less safe would be one that I would have to disagree with, in principle. I believe the contrary is true. Open Source is safer than closed source. At least the popular Open Source projects are less prone to bugs and security vulnerabilities than closed source. That is thanks to the transparency and involvement of the vast user base and developer community that are constantly detecting, advancing, and fixing issues together. I understand that log4j had this bug for quite some time and no one detected it. However in commercial products the bugs are no less severe, perhaps they are much lower profile only because they are being detected and managed behind closed doors, whereas log4j was Open Source. For known CVE’s out there, I think over 90%, you can see patches and releases out there already, so the turnaround of the community is often much faster than commercial products. I definitely feel very comfortable with the security of open source, and as a developer community and community of software owners, we must remember that at the end of the day it is our responsibility to manage the dependencies and control them just as we need to do for licencing because it is ultimately our architecture and thus our responsibility to mitigate all this.

Elasticsearch has changed their licensing model. What does it mean for the user community?

Elasticsearch has become widely popular for log management and analytics, open source under Apache 2.0 license. Less than a year ago, Elastic B.V. the company that owns and is backing the elasticsearch project has decided to relicense elasticsearch and Kibana to a non-open-source license, it is actually a dual license, SSPL and Elastic License, bottom line, it is non-OSI license (OSI being the organization that essentially defines and validates the open source licensing). For the users, this essentially means that those who were counting on the open source being something they would be free to modify the code and adapt it to their needs, this is no longer the case. So, some users are fine using it as is, closed source. For others, it is important to have the flexibility and freedom to make these sorts of changes. Those who care about continuing to develop and better the software through the community, this will no longer be the case for elasticsearch and kibana. The other pieces of the ELK Stack, if you use the client libraries and the Beats, for example, filebeat, metricbeat, logstash and so on, are still Apache 2.0, however it is important to note that even there they inserted all sorts of version checks to verify the version of the elasticsearch cluster to which they ship logs. This is a breaking change, so if you don’t work with an approved distro, or even an older elasticsearch open source clusters (Elasticsearch 7.10 or earlier), it may break so you won’t be able to actually ship your logs to your elasticsearch distro anymore. So, it is a problem and the community has had adverse reactions. For example the community, after seeing this, then gathered around a fork of elasticsearch and kibana, it’s called OpenSearch, which is backed by significant vendors, the most prominent one being AWS, or Amazon Web Services. My company Logz.io is part of this endeavor, whose goal is to keep elasticsearch and kibana open source. So OpenSearch is really an independent project that started off as a fork, and it is also a very interesting path forward for those who like elasticsearch but who wish to continue using open source.

You work at Logz.io, can you tell us what Logz.io offers?

Logz.io offers a cloud native observability platform that is based on popular open source tools. These include elasticsearch, as we mentioned, now OpenSearch, Jaeger, Prometheus, OpenTelemetry, and others. Essentially, we provide you with the ability to send your logs, metrics, and traces into one managed central location. There you can search, visualize, and correlate across your different telemetry data. So you can correlate your logs with your metrics, your traces with your logs, and so on in one central place. As I said it’s all based on open source so if you are already familiar with the Kibana UI or with Jaeger UI or with Prometheus and Grafana, you won't need to learn some proprietary UI or APIs or formats or query languages, it’s all based on open source formats that you already know and are familiar working with.

Where do you see product quality going in the future?

I think the most interesting paradigm I see is the shift of responsibilities towards engineers. Engineers are owning more of their respective features all the way to production deployment and beyond. Which means, we need to make it easy and accessible for engineers to understand the piece of software that they wrote and how it will behave in production and then once it’s in production to be able to monitor and make sure the quality of the product is maintained. This is why I am very passionate about observability. I believe that observability will play the role of the hero in enabling engineers to have at their fingertips the power to understand the telemetry and state of the system at any given time. They will be able to understand trends over time, even historically. By the way, with Logz.io we understand that you cannot separate the issue of security from devs and DevOps practices, these are currently being called DevSecOps, and they are becoming very integrative. So even down to understanding how my system is vulnerable, and including all of this under one single pane of glass. This is how I see enabling the product quality starting from engineering and going all the way to production and beyond.

Do you find that there is a difference in user sensitivity towards poor product quality between different age groups or across different countries? Has it developed over time?

I think we as consumers have grown used to better user experience and new services. If you look at Saas services, on-demand, PLG (product-led growth), such as Netflix, etc. This user experience as consumers has educated us to expect a certain level of quality from the products that we use, even in our business interactions. This is why I believe that younger generations who are more exposed to newer SaaS offerings, are more informed, educated, and accustomed to these high levels of services. However, in general we, as a population are becoming much more keen in our expectation to receive a high level of quality in our user experience and this is something that we will see become very prominent. This is a nice segway to mention my podcast, OpenObservability Talks, because I am actually going to have an episode this week devoted to SaaS observability which will discuss how SaaS observability is key to enabling this level of quality of experience with SaaS products.

Do you have a dream tool that hasn’t been developed yet?

Well, I have many but I guess the one that is most relevant for these cold winter days that we are having here is a tool that will be able to turn on the boiler on time so that I can have a hot shower when I get home. Certainly for my children, this would be the one that would have the most impact these days. Perhaps an app that would be able to help you detect and avoid exposure to the areas with the highest Covid levels. These are the two tools that would be most useful these days.

It was great speaking to you, you gave us an eye-opening perspective on logs as well as on the subject of open source software. As a developer I have always appreciated open source, and as such Shipbook is open source on all levels, including our SDK, so I found your input very interesting. Thank you for your time.


Shipbook gives you the power to remotely gather, search and analyze your user logs and exceptions in the cloud, on a per-user & session basis.

· 11 min read
Yair Bar-On
Elisha Sterngold

Yair Bar-On

Interview with Yair Bar-On | Serial entrepreneur whose company, TestFairy, was acquired earlier this year by Sauce Labs

Yair Bar-On is a serial entrepreneur whose company, TestFairy, was acquired earlier this year by Sauce Labs. TestFairy is a mobile beta testing platform, handling app distribution, session recordings, and user feedback, in enterprise and secure environments.

Beta Testing is an important part of overall app quality and therefore we are very pleased that you agreed to join us for this interview for our blog, which focuses primarily on different aspects of app quality.

Please tell us about the background and history of TestFairy.

TestFairy was founded by myself and my co-founder, Gil Megidish back in 2013. We started the company after developing mobile apps in our previous company where we had some quality problems that we needed to solve. We had customers in the field with problems and we tried to solve those problems remotely, we had no idea what was happening over there, and that is how TestFairy began, by solving a real problem. Today TestFairy helps large companies manage their mobile development lifecycle. We do quite a few things across the development lifecycle. It starts with app distribution, which is a process of helping developers put their apps in the hands of users and help them with their beta testing. We can then record videos of exactly what your users are doing with your app. We have in-app feedback capabilities, allowing users to report bugs from the field, and in production we help customers with Remote Support and Remote Logging. So, in other words, with TestFairy you can have testers or real customers use your app, and you’ll be able to solve problems and have bugs reported through a very user-friendly easy interface. All this is just the beginning. We have recently joined Sauce Labs, which is the leading provider of quality software and digital confidence. Sauce has a very powerful service providing developers with real mobile devices in the cloud, that you can use for your testing, as well as emulators and simulators that you can use to test your apps automatically. TestFairy joined Sauce Labs to enrich the offer for mobile developers.

After getting acquired by Sauce Labs, you went from being a small independent company to being part of a much larger organization. What influenced your decision?

Right at the beginning of our talks Sauce Labs has inspired us with their vision. Sauce has acquired five companies over the last year, and you can only guess that more awesomeness is coming. The idea is to expand and become a one-stop-shop for all your digital quality needs. The world relies on code, and you need to make sure your apps, websites, and digital services are all properly functioning. Businesses need digital confidence. Your product needs to work 100 percent of the time. You can’t have your app at a B rating. Anything less than excellent is just not acceptable. What Sauce does is help you test your digital assets, end to end, in a very efficient way, allowing you to test your mobile apps on real devices, on emulators, and on simulators, as I previously mentioned. We have recently acquired API Fortress, which is now a very important part of the Sauce platform. It allows you to test the server APIsthe ones that are used to communicate with your apps, so that your testing is no longer limited to the front-end alone. Another company that joined Sauce is Backtrace, who is the leading crash reporter and error handling solution for mobile, web and gaming consoles. Many of the biggest gaming companies in the world use Backtrace. Crashes are not just about getting a stack trace, and fixing a bug. You need to see trends, and real-time information that can help you understand your users and where they are encountering difficulties. Taking that information and connecting it to your pre-release testing is super powerful and this is what companies are looking for. Imagine you are able to look at a crash and a minute later fix your tests to make sure that this crash will not happen again. All that can be done in real-time, not to mention AI and machine learning capabilities. Another service recently acquired by Sauce is AutonomIQ which is a low code environment that allows people who are not developers to build test automation for services that are used by their company, like Salesforce. The people who write those tests and build them don’t have to be engineers, they can be analysts, they can be product, they don’t need to learn how to code, they just need to know their systems. The last one I’ll mention, which was the first acquisition chronologically, much earlier on, is Sauce Visual. The Sauce visual testing platform allows you to test how your product looks across versions, across environments, platforms, and devices, so that you can look for visual problems during your regular testing.

Looking at Sauce’s impressive portfolio, you can understand why I mentioned that this was inspiring.

Are there any specific issues when it comes to mobile app quality or do you see mobile apps the same way as all other kinds of software?

Absolutely. I’ll start by saying that by definition, mobile is remote. If there's an issue with your app it is always happening somewhere else, not at your desk, or on your screen so you can’t open up a new window and inspect an element and see what is happening on your mobile device on Chrome. Issues happen on a mobile device somewhere in the hand of a person that in many cases you don’t know and cannot communicate with, and under these conditions you need to start solving a problem. That is the beginning. Second, the mobile environment is extremely fragmented. You have so many Android devices, and so many iOS devices, across hardware and software configurations, different levels of OS versions and locals, and this is before we get to the mobile device itself that needs reception, wifi, battery, and more. Mobile is just so much more complicated than web or desktop development. Whenever something goes wrong, the person reporting the bug has a lot of work. They need to take screenshots, they need to pull out logs from their devices, to explain what they did. These are some of the things we are able to solve at TestFairy that help mobile testing be done properly. Together with the things that Sauce Labs has built a powerful platform we manage to make developers' lives easier.

How do you see the relationship between Beta Testing and QA? Can good Beta Testing make companies save on their QA?

There are lots of stages in quality. First, we have to split the subject of QA into two parts, there is automation, and there is testing done by real people. The world of automation is very important for any company we speak to. Companies test their apps with many technologies including Espresso, XCUITest or Appium or on the Web with Selenium and other tools. I don’t think you’ll find companies that don’t do any automation. On the other hand, you have manual testing done by real people, which is not going anywhere, and this incorporates a number of stages. Some companies look at alpha, beta, delta testing and so on. Others will look at QA and dogfood. These are all different stages that are done by different stakeholders in the organization. The typical QA will be done by QA professionals, who are people that work inside the company whose profession is quality. They will test your app or your product manually and automatically. Their job is to find problems. Beta testing, or as we call it “dogfood”, some companies call it “catfood”, or even “monkeyfood”, there are lots of names for “dogfooding”. “Drink your own champagne”, is another term that is very popular. These are terms relating to companies with a process by which company employees use the company app in order to look for problems before the app is released to production. We work with many of the largest companies in the world and help them with their dogfood process. One example that I always like to mention is Groupon. Groupon has thousands of employees who use TestFairy to manage their worldwide internal beta testing. They actually call it “catfood”. Their internal process allows any Groupon employee in the world to report bugs directly to the mobile team in Chicago, in real-time. All they need to do is shake their phone, sketch on a screenshot, and hit the send button. That bug report goes directly to Jira, that is used by the R&D team and they get bugs in real-time. This allowed people to report bugs in a much more convenient way than before and it allows developers to see and fix bugs faster. It is very important to make bug reporting easy. We spoke to lots of companies and asked users if they had a case where they had a bug that they found during internal testing, but that was not reported, and then the same bug showed up in production. The answer was yes in many cases. When we asked them why they didn’t report the bug, the answer was either, it was too complicated to report, or they didn’t have time and they couldn’t be bothered with conference calls or follow ups, or they didn’t want to report the bug at the time since they were sure they knew about it. Then when we showed them TestFairy and for them it was just a life saver, because reporting a bug takes seconds. All you need to do is shake your phone and hit send, and the bug will appear on Jira.

What is special about beta testing is you can do it with people who are not technically savvy. You can do it with literally anyone, the people in finance, or the warehouse manager, operations department, whomever. They are not technically trained, you wouldn’t necessarily talk to them about your mobile app development, but they think like your customers, and they work in the company and want your company to succeed. If you will let them help you test your app, they will very likely be helpful in finding problems and be willing to report them in real-time.

There is a lot of shifting left. Developers do more. And we see a lot of quality going to production. Some of the trends I think we’ll see more and more of is that quality signals from Prod will go back to the development process and the automatic testing process that is done pre-release and will connect, automatically. We hear that from many of the Sauce customers. We see this in the questions we get from our largest clients who built their development process for the upcoming years ahead. I think that this will be a significant trend that we will see evolve.

What is the importance of logging in Beta Testing?

TestFairy can collect logs from remote devices. So if your app writes logs and you want to get logs and put them in your central logging server, TestFairy can help with that. Some of the biggest players in logging are Sumo Logic, Splunk, Coralogix, Log4J, Logz.io, and many others. If you want your logs from all your remote mobile apps to be sent to your logging server, you can do that with TestFairy.

So TestFairy is an intermediary between the mobile apps and the cross-platform logging service providers?

Correct.

Do you have a dream tool that hasn’t been developed yet?

Yes… but I can’t tell you about it. Because if I tell you, it won’t be a secret any more :)

True… true.. Thank you very much for your time, and for sharing your expertise with us, it was very eye opening. You gave us a different perspective on the ways in which companies, and especially larger companies, ensure their apps maintain high quality.


Shipbook gives you the power to remotely gather, search and analyze your user logs and exceptions in the cloud, on a per-user & session basis.

· 7 min read
Elisha Sterngold

There are three main methods for getting feedback on mobile app behavior: capturing crashes and non-fatal exceptions, capturing user and system events, and generating logs. Each method on its own provides just one piece of the puzzle. It is only when they are all used together in an integrated manner that developers can effectively deal with the exceptions, errors, and issues experienced by the app users.

However, in the production environment, logs are generated on the device that is running the app and are thus not readily visible to the developer. In this blog post, we give hands-on examples of why and how to add the third, crucial pillar to complete the analysis and fixing of app issues in the production environment: logs.

· 4 min read

Logs Help Solve Mysterious Car Accidents:

tesla automatic car image (Image credits: https://www.tesla.com)

Let’s take a look at a speeding car, minutes before the accident. The vehicle is driving at a high speed, and the driver starts to depress the brake pedal. At what point is the automatic emergency braking function activated? If not immediately, when does it take effect? How much effect does it have on the velocity of the halting car?
Tesla’s automatic car system was first introduced in 2014 and as of May 2021 the advanced autopilot system was released to a few thousand Tesla owners for testing. Since then, there have been a number of collisions and accidents that have put a tremendous amount of pressure on the company, from the public as well as from family members of the victims. Luckily, or shall we say, thanks to responsible company practice, Tesla’s cars log every drive so that each action and reaction of the car’s system is uploaded to the cloud. This makes finding the cause of accidents as simple as clicking a button.

· 7 min read

What Did GDPR Change?

gdpr protection cloud image

Prior to its coming into effect, a directive governed what companies could and could not collect about their customers and users. The issue with a directive is that it allows each of the member states of the EU to adopt and edit the directive to fit their needs. The GDPR in comparison, must be accepted in its entirety by all member states of the EU. It also applies to companies located outside of the EU, but with activity within the EU. In short, the ratification of the GDPR has made data protection more expansive, up-to-date, non-negotiable and compulsory.

· 4 min read

captain work from home corona image

Challenging Times Call for Extraordinary Measures

Times are challenging and in many parts of the world even frightening. Both health wise and on an economic scale.

Almost everything seems to have stopped. International flights have certainly stopped. Many stores, building projects, hairdressers and so on are not able to continue their businesses for the moment. Like with so many other things, the stalemate is spreading like rings in the water.

We are all hoping and waiting for this to pass and start working as usual again. One big question is of course what we do in the meantime. Only a few companies are daring to develop new products and services during such times. The risks are high and many find it better to sit tight for the time being.

· 6 min read
Avishag Sterngold

HOW TO PROTECT THE CUSTOMER’S PRIVACY WHILST FIXING PROBLEMS IN A MOBILE APP?

This blog is not intended as legal advice

GDPR and logs image

Are you the type of person who likes everyone to know everything about them? Or are you the type who is careful that all information about them is deleted?

The truth is that it’s not so important which group you belong to - what’s more important is that you are one of those who understands that mobile logs contain important information which is gathered from the user – and it’s not just ‘any’ information.

· 4 min read
Elisha Sterngold

Is Crash Reporting Enough?

crash reporting is not enough gif

In the mobile app development world it is very common to use Crash Reporting Tools. But it is much less common to use Log Management Tools.

Developers view a crash in the app as a top priority that has to be fixed.

There is no doubt that a crash in the app is very disturbing for the user experience of the app. But, it isn’t the only thing that disturbs the user experience.

· 5 min read
Elisha Sterngold

History

logs in a pile debugging vs logging image

In the early days of computer programming there were no good debuggers available that helped you debug the program step by step, therefore the easiest way was to print it in the console.

Even when there were better debuggers they were not well integrated into an IDE so it was much easier just to write printf and to get the information thru the console. But once the developer has already written many printfs/logs, why not just save the console output so that the developer will always have this information when needed.

· 7 min read
Elisha Sterngold
Eran Kinsbruner

mobile error log

Intro

In the software development world, when developers want to understand why something doesn’t work in the program that they developed, the first place to look is the logs. The logs give the developer a “behind the scene” view of what happened to the code when the program ran.

The question is why, in mobile app development, developers aren’t using the app logs to analyze app issues. Some even remove them before production, for example, in Android they use ״proguard״ to remove the logs.

· 3 min read
Elisha Sterngold

Log Severity Intro

log severity ruler gif

This subject may sound boring. Don’t all programmers know which log severity should be used? The answer is that people are not logging their app in a systematic way. There are several guides that explain the levels but they usually just define them. I’ll try to help you decide which log level should be used. I’ll give examples so that that you will be able to copy it into your app. I’m going to list the severities of Android.