Machine Learning models, in particular Deep Neural Networks (DNNs), have expensive design and development processes, requiring substantial resources to produce a high-quality model (e.g., data collection, long training time, high level of computing capabilities, etc.). This motivates the consideration of DNNs as Intellectual Property (IP), and an attractive target for actors with the incentive to steal the model in order to benefit from the work of its owners.
In recent years, various methods were suggested to detect models obtained illegitimately from their owners, however they have been shown to fail to demonstrate satisfactory robustness against model extraction attacks.
In this talk, we present an adaptive watermarking framework for machine learning models, leveraging the unique behavior present in a protected model as a result of a unique random seed initialized during the model training. This watermark is used to detect extracted models, which inherit the same unique behavior, indicating an unauthorized usage of the protected model's intellectual property (IP).
First, we show how an initial seed for random number generation as part of model training produces distinct characteristics in the model's decision boundaries, which are inherited by extracted models and present in their decision boundaries, but aren't present in non-extracted models trained on the same data-set with a different seed.
Based on our findings, we suggest the Robust Adaptive Watermarking (RAW) Framework, which utilizes the unique behavior present in the protected and extracted models to generate a watermark key-set and verification model.
We show that the framework is robust to (1) unseen model extraction attacks, and (2) extracted models which undergo a blurring method (e.g., weight pruning).
We evaluate the framework's robustness against a naive attacker (unaware that the model is watermarked), and an informed attacker (who employs blurring strategies to remove watermarked behavior from an extracted model), and achieve AUC values exceeding 0.9.
Finally, we show that the framework is robust to model extraction attacks with different structure and/or architecture than the protected model.