Lessons in Digital Entropy from Half a Million Image URLs
| Speaker | Pierre Hugo |
|---|---|
| Track | Data Science and Engineering |
| Type | Regular talk (45 minutes) |
Abstract
What if we could measure the internet's decay rate? Using production data from 494,000 real-world image URLs, we've quantified how fast web content rots.
Our analysis reveals web images follow predictable decay patterns with measurable half-lives: fresh content succeeds 70% of the time, but drops to 30% after just two years.
We quantify how content accessibility changes over time, revealing exponential degradation curves that follow consistent mathematical models. More surprisingly, through systematic testing of recovery strategies, from smart retries to header spoofing, we found that many "dead" images can be resurrected with the right techniques.
We'll demonstrate how Python's requests library combined with statistical analysis revealed the optimal timeout rules, quantified the cost-benefit tradeoffs of different retry strategies, and even uncovered historical bugs through "digital archaeology" of failure patterns.
This talk combines practical engineering insights with data science methodology, showing how production systems can be designed around these newly-quantified decay patterns.
Attendees will learn not just why their images URLs from "the wild" break, but when, how often, and what can be done about it. These investigations have directly improved Morphic's image scraping performance, and the methodologies we developed offer actionable insights for anyone dealing with web content reliability.
