White box analytics

A greatly misunderstood area of data privacy is the “data” portion of the term. Many people assume that I am personally against data use, simply because I work on privacy issues. But remember: data is not just a point of risk in privacy, it is also a crucial basis of making decisions informed by a demonstrable and provable basis in fact. So, it is necessary to use data without allowing improper collection, handling, and use of the data to harm groups, institutions, and individuals.

To this point, the discussion in the Intel draft regarding AI is essential. The draft, in a nutshell, has a provision that prevents “..the use or application of outputs from machine learning, algorithms, predictive analytics or similar analysis that would violate any state or federal law or regulation to wrongly discriminate against individuals or facilitate such discrimination, or deny any individual the exercise of any Constitutionally-protected right or privilege.”

This is extremely helpful, and I support this language strongly. There is a place here for one addition that will assist in data management, or better, knowledge governance, in the long term. That is, I would like to see language about white box analytics. This is almost never discussed by privacy experts. White box analytics is about using de-identified data to accomplish data goals. Data should, when practicable, be robustly de-identified at the source, so that data comprising the basis of important statistical research is gathered, but can genuinely not be harmful to data subjects. With white box approaches, using raw data for analysis is not necessary. Much, but not all, of the privacy harm associated with data use stems from analysis using raw data. Nevertheless, reducing the spread and use of raw data for analytics is an important aspect of a more evolved data use policy.

White box analytical techniques are already being used widely in the financial sector to determine “KYC” or know your customer information using only de-identified data. This model is going to be key to apply across knowledge domains. As a precise example, companies such as ThreatMetrix conduct financial sector KYC analysis using only hashed personal data in a white box machine learning model — this is a best practice. The company does not work with the raw data, therefore it does not know the private details of each individual, but the analysis will tell them the probability of the individual being a “known” individual. KYC duties are fulfilled, and privacy and safety are preserved. There is still a need to prevent improper use of the analytic results, which the proposed bill language discusses.

Data use that is safe for data subjects, both individually and in groups, throughout the data lifecycle is the pathway forward. The proposed analytics language is already very good here, but with additions regarding data de-identification and standards for analytics using hashed data, then progress will be better than just acceptable, it will be beneficial, and all parties get a win.

4 comments

  1. Annie Anton
    I also like the fact that the proposed bill prevents “the use or application of outputs from machine learning, algorithms, predictive analytics or similar analysis that would violate any state ate or federal law or regulation to wrongly discriminate against individuals or facilitate such discrimination, or deny any individual the exercise of any Constitutionally-protected right or privilege.” Having said that, for years I’ve advocated for technology-neutral laws because technology evolves so quickly. When specific technologies are codified in law, the law can be rendered obsolete in a few years and/or lead to entrenchment in outdated technologies. White-box analytics are thought of as ways improve credibility and trustworthiness. They accomplish these two objectives via transparency. Transparent algorithms allow us to answer “why?” questions so that we can refine and improve our analytical approaches. Thus, I would prefer to see the existing language concerning machine learning focused on the objectives that we want to see achieved, specifically “transparency”.

  2. Pam Dixon
    Well-said, Annie. I agree completely. Transparency is the key, and it is essential to convery the ideas driving white-box analytics while retaining technology-neutral language. Otherwise, as we have all painfully lly learned, aging language regarding technologies (as is readily found in ECPA) invariably creates meaningful gaps in intended protections.

  3. Omer Tene
    Specific wording comments: In 4(d)(4), I think the obligation “shall only be done after the covered entity conducts an assessment….” should be made ongoing. That is, organizations should be required required to reassess this periodically, not just at the point of deployment. This is particularly important given the “black box” characteristic – and potentially covert discrimination – of machine learning. To paraphrase J Stewart, you only know it when you see it.
    Per 4(d)(4)(C), an organization may proceed with automated decision making only if it “Concludes that, after all reasonable steps are taken to mitigate privacy risk, the automated processing does not cause, or is not likely to cause, substantial privacy risk.” I’m not sure this is true. In some cases, we want the organization to proceed even in the presence of substantial privacy risk. It depends what other interests are at stake. Think obviously TSA body scanners, DHS cyber defense. But also commercial uses, KYC by banks, etc. I think there needs to be an additional balancing provision/step.

    • David Hoffman

      An excellent recommendation. I completely agree and it is on the list of necessary changes for the next draft. Please keep them coming!