I-STUDIO is an automatic assessment dataset for both NLP and statistical education use. This dataset consists of an instrument to assess the statistical reasoning skills of college students. This dataset derives from an investigation of an assessment instrument entitled: The Introductory Statistics Transfer of Understanding and Discernment Outcomes (I-STUDIO). The instrument consists of three scenarios, each one followed by two short answer question prompts. A detailed rubric was used for assessment, placing each answer on a 3-way scale of {incorrect, partially correct, correct}. The rubric provides an intensional specification of the criteria for each correctness value, along with an extensional specification consisting of example student answers. A reference answer is shown, along with two student answers, one that is correct (2), one that is partially correct (1). The data collection used a sample of 1,935 students (from colleges across the USA and elsewhere) who completed the I-STUDIO instrument, which included 6 open ended questions with one or more parts to each question, while describing their thinking out loud. For the reliability study, the I-STUDIO investigator provided two sets of labels, his original labels from 2015, and a new labeling he applied in 2021. The other two raters were graduate students in statistics, trained by the investigator. The data is all de-identified. The original study had IRB approval, and that IRB confirmed that the data can now be made public for research purposes in this de-identified form. To assemble the final Col-STAT dataset for NLP research, we performed data cleaning to eliminate non-answers, and responses that were unusually long (greater than 125 word tokens). Then it was partitioned into training, validation, and test sets in the proportion 8:1:1. The test set was initialized to contain all the responses that human raters had graded, so that at test time, inter-rater agreement of model predictions can be computed with the reliable human labels. The remaining data was randomly partitioned to produce, resulting in the training (N=5,018), validation (N=627), and test (N=627) set sizes.