Nearly 12,000 API keys and passwords were discovered in the Common Crawl dataset used to train AI models, raising concerns about insecure coding practices. Researchers found 11,908 valid secrets after examining 400 terabytes of data from billions of web pages. Among these were AWS and MailChimp keys, often hardcoded into HTML and JavaScript. Vulnerabilities include potential misuse for phishing and data exfiltration. The study highlights the challenge of removing sensitive information from large datasets despite pre-processing efforts.
Nearly 12,000 API Keys and Passwords Found in AI Training Dataset
