A greatly misunderstood area of data privacy is the “data” portion of the term. Many people assume that I am personally against data use, simply because I work on privacy issues. But remember: data is not just a point of risk in privacy, it is also a crucial basis of making decisions informed by a demonstrable and provable basis in fact. So, it is necessary to use data without allowing improper collection, handling, and use of the data to harm groups, institutions, and individuals.
To this point, the discussion in the Intel draft regarding AI is essential. The draft, in a nutshell, has a provision that prevents “..the use or application of outputs from machine learning, algorithms, predictive analytics or similar analysis that would violate any state or federal law or regulation to wrongly discriminate against individuals or facilitate such discrimination, or deny any individual the exercise of any Constitutionally-protected right or privilege.”
This is extremely helpful, and I support this language strongly. There is a place here for one addition that will assist in data management, or better, knowledge governance, in the long term. That is, I would like to see language about white box analytics. This is almost never discussed by privacy experts. White box analytics is about using de-identified data to accomplish data goals. Data should, when practicable, be robustly de-identified at the source, so that data comprising the basis of important statistical research is gathered, but can genuinely not be harmful to data subjects. With white box approaches, using raw data for analysis is not necessary. Much, but not all, of the privacy harm associated with data use stems from analysis using raw data. Nevertheless, reducing the spread and use of raw data for analytics is an important aspect of a more evolved data use policy.
White box analytical techniques are already being used widely in the financial sector to determine “KYC” or know your customer information using only de-identified data. This model is going to be key to apply across knowledge domains. As a precise example, companies such as ThreatMetrix conduct financial sector KYC analysis using only hashed personal data in a white box machine learning model — this is a best practice. The company does not work with the raw data, therefore it does not know the private details of each individual, but the analysis will tell them the probability of the individual being a “known” individual. KYC duties are fulfilled, and privacy and safety are preserved. There is still a need to prevent improper use of the analytic results, which the proposed bill language discusses.
Data use that is safe for data subjects, both individually and in groups, throughout the data lifecycle is the pathway forward. The proposed analytics language is already very good here, but with additions regarding data de-identification and standards for analytics using hashed data, then progress will be better than just acceptable, it will be beneficial, and all parties get a win.