STAC Study - NLP topic modeling

This study was designed to illustrate how STAC Benchmarks for machine learning (ML) can be constructed and used. It is also intended to help data scientists and data engineers know what to expect when using the data science tools and cloud products of this project and how to avoid common pitfalls. 

 

The workload is topic modeling of SEC Form 10-K filings using Latent Dirichlet Allocation (LDA), a form of natural language processing (NLP). We used this workload to explore the question of scale-up versus scale-out in a cloud environment on three SUTs:

 

  1. A single Google Cloud Platform (GCP) n1-standard-16 instance with Skylake and RHEL 7.6
  2. A single GCP n1-standard-96 instance with Skylake and RHEL 7.6.
  3. A Google Cloud Dataproc (Spark as a service) cluster containing 13 x n1-standard-16 Skylake instances (1 master and 12 worker nodes) and Debian Linux 8.

A document with excerpts from the study is available for download below. The full STAC Study is available to subscribers of the Analytics STAC Track. The full study contains additional test results and detailed configuration information, extensive performance analysis, and "war stories" about working with the key tools and products in the project as a regular customer. The implementation code in the STAC source code repository and the dataset are also available. Please contact us for access.

The test design is a proposal to elicit feedback from the STAC AI Group on use cases, benchmark design, and research priorities around ML techniques and technologies.

Please log in to see file attachments. If you are not registered, you may register for no charge.

STAC uses the term "artificial intelligence" (AI) as an umbrella term for machine learning, deep learning (and other neural approaches), and any other techniques for getting computers to do what only humans could do a few years ago. (We're not trying to get into philosophical debates, but we need a vocabulary.)