jason atchley

Predictive Coding Is So Yesterday

Computational linguistics and data mining are among the tools that will drive e-discovery in coming years.

Law Technology News

February 18, 2014 |0 Comments

Editor's Note: This article was chosen in a blind competition by the Arizona State University-Arkfeld E-Discovery and Digital Evidence Conference. The three winners have been invited to present their papers during the conference, which will be held March 12-14 at ASU's Sandra Day O'Connor College of Law, in Tempe, Ariz. See also, "Vendor Voice: Yes, Counselor, There Will Be Math."

Today, much of the electronic data discovery industry is racing to improve predictive coding, which is but one approach to technology-assisted review—and one with inherent limitations. Instead of refining predictive coding, tomorrow’s innovative EDD technology game changers will employ computational linguistics, data mining, language translation, corpus-based content analysis and case specific information supplied in the form natural language inquiries.

This is not to say others haven’t attempted to apply these technologies to EDD. However, significant advances in quality, and cost-reducing innovations, will be driven by the integration of techniques from these disciplines.

Predictive coding depends critically on the creation of “training sets,” created by one or more human reviewers through manual review. The quality of these training sets entirely determines the recall and precision achieved by the technology, because, to date, predictive coding tools apply information from training sets but do not correct reviewer errors within training sets. While current offerings use a variety of methods for selecting electronically stored information to be reviewed, none of the methods can actually assist the user in making correct markings. The lack of analysis taking place in the front-end of today’s predictive coding offerings place an upper limit on predictive coding effectiveness.

Once predictive coding applies a training set to a population of unreviewed ESI, a set of human reviewers must once again review a selected set of ESI marked by the predictive coding technology. Once again the human reviewer markings of ESI critically determines the accuracy and completeness of the next round of predictive coding markings of the unreviewed ESI. In reality, a human reviewer can inconsistently mark ESI during creation of a training set or during subsequent review of predictive coding markings without any feedback or error checking.

Perhaps the most powerful claims of predictive coding are also the most damning. The fact that it is equally as accurate and complete as human review is best evaluated the way all technology is evaluated. Technology makes our lives better—we travel faster, hear better, see farther, lift more weight, and drill smaller holes, than we possibly could without it. The reality that predictive coding enables us to review more ESI than if done entirely by human reviewers certainly is true, yet that claim stems from large storage capabilities and fast processor speeds executing the predictive coding tools, not the predictive coding technology itself.

Predictive coding brings limitations with its advantages. The dependence upon the accuracy of human review, review that takes place without feedback or error checking, will limit the recall and precision of predictive coding until some type of pre-processing is done to relate ESI content and thus perform some type of error checking. Without ESI content analysis and relationship identification, human review errors propagate into the technology, especially if these errors occur consistently.

The Future of Predictive Coding

Technology-assisted review tools of the future will analyze ESI using computational linguistics without any user input to analyze ESI content, far beyond keywords, key phrases, or training sets of ESI documents subject to human error. Content analysis will allow powerful categorization of ESI based on data mining and language translation techniques.

Once categorized, users can review categories of ESI rather than any set number of ESI items (as is required by predictive coding). Human review will take place under error checking and marking consistency feedback made possible through information theory measures drawn from categories of semantic meaning. Such meaning will be determined by the content of ESI populations and not any error-prone training of predictive coding.

Because these methods analyze ESI based on each the semantic content of ESI data sets, such methods will be adaptable to a wide range of ESI content and in fact be ESI content driven. In other words, there will be no training sets—or user defined categories—for the algorithms to learn about or compare.

The categories derived from the semantic meaning of ESI within data sets will support fast corpus-based analysis by human reviewers. As human reviewers mark semantic categories, rather than individual ESI, the systems of tomorrow will concurrently compare reviewer actions and markings to existing categories and markings. Comparison will provide feedback to improve the review process in real-time, not overnight through a learning process.

This comparison will provide feedback to the human reviewer to assist the reviewer in taking correct actions and making consistent markings. Such on-the-fly feedback and consistency checking will elevate human reviewers to more powerful reviewers—increasing the accuracy, speed, and consistency of review. Rather than a predictive coding tool working to understand user actions and making the best of user errors and inconsistencies, the technology will create a super-user, able to produce better results faster and cheaper. In addition, if new issues arise, the super-user need only return to the analyzed and categorized ESI to investigate and locate pertinent ESI—no new training set need be created and there is no need for yet another cycle of “review-train-revise-train…” to suffer through, wait for, and pay for.

The quality future technologies will be uniquely based on the ESI, the analysis and categorization algorithms, and the human reviewer—transformed into a super-user—will benefit from error checking and consistency measures. The integration of these multiple fields will bring new tools to TAR just as the introduction of new technologies expanded power, speed, and other abilities of humans. In fact, these innovators not bound by years of investment in predictive coding are bringing new technologies to the market today. These tools will make predictive coding the analog technology of yesterday.

Joel Henry is an attorney professor of computer science and IT legal advisor at the University of Montana, based in Missoula.

Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley

Jason Atchley Jason Atchley Jason Atchley Jason Atchley

Jason Atchley

Wednesday, February 19, 2014

Jason Atchley : eDiscovery : Predictive Coding is SOOOO Yesterday!

Predictive Coding Is So Yesterday

The Future of Predictive Coding

No comments:

Post a Comment

About Me

Blog Archive