“All Graphs Are Wrong, but Some are Useful” with Xan Gregg

Data visualization is our most efficient tool for understanding information, but it’s far from perfect. Collected data is an imperfect representation of the underlying information. A graph is an imperfect representation of the data. Our understanding is an imperfect representation of the graph. But don’t despair. Xan Gregg, creator of the Graph Builder, will talk about how understanding visual perception can help us make more effective data visualizations.

Xan Gregg leads data visualization development at JMP, a business unit of SAS that specializes in data visualization software. He created the Graph Builder feature introduced in JMP 8 and continues to be its principal developer. Gregg is a frequent contributor to the JMP Blog and is known for his series of graph makeover posts. He is founder of the One Less Pie social media campaign, which seeks to replace inappropriate pie charts with better alternatives. Gregg participates in the online JMP User Community and often speaks at customer events like the JMP Discovery Summit. At the inaugural Discovery Summit Europe, Gregg won the award for Best Invited Paper.

Gregg is an active participant on visualization question-and-answer sites Cross Validated and HelpMeViz. In 2006, he won first place in Business Intelligence Network’s Data Visualization Competition. Gregg has participated in volunteer hackathons, including one that produced a highly acclaimed graphic for the 2015 Hunger Report. His primary fields of interest are exploratory data analysis and information visualization. He is also a regular at RTA Meetups!

RTA is proud to support Data4Decisions this year!

 Presentation: Spark vs. Hadoop for Big Data

As the landscape of big data processing engines rapidly evolves, many are left to wonder which tool they should use? Hadoop, Spark, something else? In this session, you’ll learn about which engines are a best fit for specific use cases, skill sets, code deployments, and resource constraints. We’ll provide a framework for deciding on the right tool for your job and help you determine when you shouldn’t choose one over the other, but rather use both in conjunction. You can also expect to learn how cloud technologies are simplifying infrastructure to make it easier to leverage multiple data engines and avoid lock-in.

Presenters: Technology experts (Kalen Zhang and Phil D’Agostino) from Qubole


Analytics_Forward 2016 unconference hosted by and for analytics professionals.

“Opening the Black Box: A Users Guide to Optimization” with Melinda Thielbar

Statistical models, recommendation engines, data visualization, and artificial intelligence all have one thing in common: They use algorithms to find the best answer (or to rank potential answers so the user can pick one). Numerical optimization is at the heart of this process. Understanding how the computer finds its answers can help you get the most out of your data–no matter what analysis software you use.

Intended as a user’s guide to numerical algorithms, this talk shows the most common types of computer optimization, how and why they’re used in different kinds of data analysis, and their underlying assumptions. Examples with visualizations are shown in JMP software, but the concepts apply to any statistical or data visualization package.

Whether you’re new to data analysis or writing your own optimization routines, this presentation demonstrates a useful way to think about computer optimization that will open up the black box between “data in” and “answers out”.

Melinda Thielbar, PhD (JMP Senior Research Statistician Developer, SAS) currently specializes in statistical methods for consumer research and categorical data, though she has experience as far ranging as naval power system analysis, fraud detection and Hollywood script consulting. Melinda is a Co-Founder of Research Triangle Analysts, a crazy cat lady, and an enthusiastic amateur artist.

Special Interest Group: JMP – Kickoff Meeting

Special Interest Group: Sport Analytics – Kickoff Meeting

Do you get excited using math with sports data? Did you learn analytics through sports? Want to share opportunities to use sports analytics to teach, for a hobby, for research, for a career?

This is our kickoff meeting for this new subgroup. There will be a brief overview of sports analytics and then the floor will be open for discussion on future topics/speakers. Feel free to bring use cases for discussion.

“Topological Data Analysis” with Hamza Ghadyali

“Data has shape– and shape has meaning.” [1] Topology is the mathematical study of shape and in the past decade TDA tools have been applied to large, noisy, complex datasets to understand problems in many science and engineering disciplines including oncology, astronomy, and neuroscience.  In this talk, I’ll explain what topology is, briefly go into the mathematics of persistent homology and Morse-filtrations, and discuss some applications in signal processing, clustering, and pattern recognition.  To ensure that everyone gets something out of the talk, pictures will be emphasized over formulas.
Hamza Ghadyali is a Ph.D. candidate in mathematics at Duke, developing new TDA tools, in particular for the analysis of EEG data from people with epilepsy.

November is planning month at Research Triangle Analysts. Join us for a beverage and while we all talk about what brought you to our past events (the survey is now closed) and what will bring you to our future events.

 NC Data4Good: Data Crunch

We have partnered with United Way of the Greater Triangle, Data Crunch Lab, and MaxPoint to address childhood hunger and food insecurity here in the Triangle (results from the 2015 Data Crunch for Social Good).


Board Meeting

Melissa Nysewander presented: “Applied Data Science: A Case Study in Workforce Analytics”

Data science is more than a single algorithm or technology, it is a methodology tying together scientific reasoning, hypothesis testing, machine learning, and statistics. It is about knowing enough programming to grab and manipulate data at the finest grain, and enough statistics to extract real (not spurious) insights. But it is also about being able to ask the right questions, design meaningful tests, and in the end, communicate results to the people making decisions. This talk is a practical explanation of what it takes to successfully execute an enterprise-level data science project, from beginning to end, emphasizing both the soft and hard skills necessary to do so. To illustrate, our speaker will present a recent case study in workforce analytics in which she performed text analysis using Python & R on scraped web data.

Danny Siegle presented: “Machine Learning and the Life Sciences”

This presentation covers machine learning from different biological domains, together with working code examples, including:
1. QSAR prediction for drug discovery
2. A Next-Gen Sequencing application
3. Code example from the recent Kaggle diabetic retinopathy competition (diagnostic image analysis)

The goal of this talk is not to cover technical details of every method but to to help biologists to see the value of machine learning and statisticians to understand opportunities in the biological sciences.
The presentation notebook is on GitHub, so that anyone interested in taking a deeper dive can run the code and see the results.


Lucia Gjeltema presented: “SparkR – distributed computing in R using Spark clusters”

Description (code for the demo):
Data processing and machine learning tasks in R are usually limited to data sets that fit in the memory of one single machine. The new R frontend to Apache Spark, called SparkR, harnesses Spark’s distributed computing powers to run large-scale data analysis directly in R. Originally an R package, SparkR is now officially merged into Apache Spark (since release 1.4 in June 2015).
This talk introduces SparkR and one of its core components – the SparkR DataFrame, a way of bringing distributed computing capabilities to the world of data frames.


Chris Calloway presented: “Python Data Science with Pandas”

Description (video of the talk):
Pandas is a software package providing R-like “data frame” wrangling in Python. We interactively explore
• Data input and output
• Data transformation
• Data analysis
• Data visualization
with Pandas using some interesting data to answer contemporary social questions.


Ian Cook presented: “Working with Geospatial Data”

Ian Cook talks about working with Geospatial data for analysis, including:
• What spatial data is.
• Where you can find spatial data to work with.
• The challenges of working with spatial data.
• Key facilities R provides for loading, manipulating, and analyzing spatial data.
Demonstrations are in R and Spotfire, with an open discussion for others to talk about how they work with spatial data using their preferred tools.

Cyber Security Mini-Hackathon

This is a joint meetup with the Big Data and Cyber Security Meetup. Bring your laptops and be ready to work with security experts to understand network data and develop ways to analyze it!

We will be working with one of the data sets from this site: (suggestions for which data set from this list are welcome).
If you don’t have it already, you will probably want the free program WireShark on your laptop:
Also, have your analytics program of choice loaded up and ready to go!

Analysis and results can be posted on


Rajesh Seluklar presented: “Functional Modeling of Longitudinal Data”

Description [link to Rajesh's paper]:
In many studies, a continuous response variable is repeatedly measured over time on one or more subjects. The subjects might be grouped into different categories, such as cases and controls. The study of resulting observation profiles as functions of time is called functional data analysis. This paper shows how you can use the SSM procedure in SAS/ETS® software to model these functional data by using structural state space models (SSMs). A structural SSM decomposes a subject profile into latent components such as the group mean curve, the subject-specific deviation curve, and the covariate effects. The SSM procedure enables you to fit a rich class of structural SSMs, which permit latent components that have a wide variety of patterns. For example, the latent components can be different types of smoothing splines, including polynomial smoothing splines of any order and all L-splines up to order 2. The SSM procedure efficiently computes the restricted maximum likelihood (REML) estimates of the model parameters and the best linear unbiased predictors (BLUPs) of the latent components (and their derivatives). The paper presents several real-life examples that show how you can fit, diagnose, and select structural SSMs; test hypotheses about the latent components in the model; and interpolate and extrapolate these latent components.


 “Have an idea, need an idea” – An Unmeeting

This is ‘have an idea/need an idea’. Come with a question about analytics, something cool you’ve done, or a problem that has you stumped. Get some feedback from your fellow analysts about where you can look for more resources or go next!
We’ll be sitting at tables of 6 or so, so use this discussion board here to get the conversation started about what we should discuss!

Analytics Forward – An Unconference

Analytics Forward is a free unconference by and for analytics professionals. Thanks to our amazing sponsors Cross and Blue Shield of North Carolina, JMP, MaxPoint, and NCDS, we spent a Saturday at the Blue Cross and Blue Shield of North Carolina campus learning about the latest techniques, trends, and tools in analytics.


Tim Hopper presented: “Pyspark”

Description (slides and code):
Apache Spark is a next generation cluster computing framework and data processing engine. By combining Spark’s primitive operations in a functional style, the user can perform complex computations on large datasets. Though similar to Hadoop, Spark relies much more heavily on RAM (instead of HDFS) and has been demonstrated as running up to 100x faster than Hadoop for some applications. This talk will introduce Spark in general and then show PySpark, the Python wrapper around core Spark, as a tool for rapid, interactive analytics as well as robust, production data pipelines. Finally, we will look at MLlib, Spark’s distributed machine learning library.

Tim Hopper is a software engineer at, a web analytics startup. He has a masters in operations research from North Carolina State University.


 Grant Ingersoll presented: “Solr 5: scalable search and analytics in one place”

Search engine technology is rapidly evolving from keyword based looks up to a highly sophisticated ranking engine capable of incorporating many different features across complex data types. With the pending release of Apache Solr 5, it is now possible to ask more interesting questions of multi-structured content than ever before. In this talk, we’ll explore how Solr 5 provides a number of new and interesting features — ranging from incredibly easy data ingest to advanced faceting and statistical capabilities — for analysts and why Solr should be in every analysts toolbox.

Grant is the CTO and co-founder of LucidWorks, co-author of “Taming Text” from Manning Publications, co-founder of Apache Mahout and a long-standing committer on the Apache Lucene and Solr open source projects. Grant’s experience includes engineering a variety of search, question answering and natural language processing applications for a variety of domains and languages. He earned his B.S. from Amherst College in Math and Computer Science and his M.S. in Computer Science from Syracuse University.

Link to video:


Plan next year! November is planning month at Research Triangle Analysts. Come have a beer and talk about what you want to do next year. We will be sending out a survey and go through the findings during our meetup.

Steve Geringer presented: “How to Build Effective Machine Learning Applications”

Machines are getting smarter every day. How do they it? What will be left for the humans once the machines completely take over? Learn how you can contribute to the subjugation of mankind by building your own machine learning applications. While we won’t cover sci-fi or philosophical aspects, we will cover many important technical considerations for building effective machine learning (ML) applications.
Steve Geringer is a triangle area software consultant and ML enthusiast.

Elizabeth Claassen presented: “Improved Inference in Generalized Linear Mixed Models”

In small samples it is well known that the standard methods for estimating variance components in a generalized linear mixed model (GLMM), pseudo-likelihood and maximum likelihood, yield estimates that are biased downward. An important consequence of this is that inferences on fixed effects will have inflated Type I error rates because their precision is overstated. We introduce a new method for estimating parameters in GLMMs that applies a Firth bias adjustment to the maximum likelihood-based GLMM estimating algorithm. We apply this technique to one- and two-treatment logistic regression models with a single random effect. We show simulation results that demonstrate that the Firth-adjusted variance component estimates are substantially less biased than maximum likelihood estimates and that inferences using the Firth estimates maintain their Type I error rates more closely than the standard methods.

Laurel Trantham presented: “Utilization and Substitution of Urgent Care, Emergency Departments, and Primary Care Physicians”

Blue Cross Blue Shield is always looking to reduce healthcare costs. One driver of high costs is that many individuals receive medical care at emergency rooms when urgent care centers and primary care offices may be more appropriate sites of care.  Laurel Trantham reviewed some of the analysis in this area, including why this is important to explore, and discussed several modeling approaches being considered.

Brian Fannin presented: “Statistics Without Borders”

Brian Fannin shared his experience as part of ‘Statistics Without Borders’ ( The team spent a week in Africa teaching R and statistical modeling to members of the Rwandan Biomedical Center.

Brian is not a proper statistician (he’s an actuary), but he loves R, loves to travel and loves to try and make the world a better place through data. He especially loves doing all three at once.

Mason DeCamillis presented: “Introduction to Julia”

Julia is a relatively new programming language that aims to blend the good parts of Matlab, C, R, and Python (with fewer of the bad parts). Its growth in popularity make it an increasingly promising option for programmers doing technical, computationally-intensive work. This presentation explored the advantages of Julia in a data analysis context, with examples from both the base library and several user-written packages. Additional information is available at and

Mason DeCamillis is a statistical programmer and data analyst with a Master’s degree in Applied Statistics and a knack for crashing his computer by testing out experimental software. He is cautiously enthusiastic about Julia (see ), and is excited to share with Research Triangle Analysts.

Joseph Morgan presented: “Covering Arrays”

Software (and analytical model) testing may require considering hundreds or thousands of parameters. Usual “test all” or “full factorial” methods can require too many runs  to be practical. Covering arrays make it possible to consider “full coverage” of a software suite with a smaller number of runs (see  ).
Joseph Morgan, Senior Software Developer for JMP at SAS Institute, presented his research on this important field.

Have an Idea, Need an Idea!

1) Come in with an idea you’d like to discuss – either a problem that you’re stuck on, or a great idea where you’d like some feedback.
2) Be ready to present to a small group of about 4-6 people while you enjoy the great food and craft beer at Mattie B’s. This is a sit down presentation. You can bring your laptop and show some code if you want, but this is mostly a chance to “think out loud” with some interested folks.
One of the biggest interest areas from the Feedback survey was “Want to connect with peers,” but the social events got the most votes for “least favorite” meetings. This is a chance to find people who are interested in some of your favorite topics!

Dan Kelly presented: “Random Forests and Boosted Trees”

One of the most-used predictive modeling techniques, the decision tree has a lot of great interpretation as well as predictive properties. But single decision trees can overfit your data and give misleading results. How do you decide when the tree has enough “branches”? Enter the random forest.

We had a discussion on our new mission statement at this meeting (attendants shared with us how they would like us to serve them and what they envision the Research Triangle Analysts to become in the future). After party at MEZ Contemporary Mexican Restaurant.

We brainstormed on starting a nonprofit organization.

Lucia Gjeltema presented: “Community Detection”

Network graph analysis is a hot topic in social media, fraud detection, and academia. In many applications, networked individuals end up on one large “clump”, making further analysis nearly impossible. Community detection is one way to break a huge graph into small meaningful groups for real-world analyses.
Various structural definitions of graph communities were introduced and an overview of algorithms that capture them was given. The presentation was concluded with a review of performance metrics that compare detected communities with ground-truth information.

We discussed starting a nonprofit organization.

Tim Hopper presented: “Intro to Scikit-Learn”

Scikit-learn is an actively developed Python package providing an implementation of many machine learning algorithms (e.g. SVM, kNN, linear models, HMM, k-Means, spectral clustering). However, the benefits of Scikit-learn goes well beyond carefully implemented learning algorithms. Being built in Python, it allows easy integration with countless other Python modules for tasks such as plotting, data munging, and application development. Its consistent API across algorithms allows for rapid experimentation with multiple learning methods. Also, Scikit-learn is well documented and provides lots of examples.

Instead of discussing particular machine learning algorithms provided by the package, I will focus on Scikit-learn and Python as a toolkit for solving data problems from start to finish. I will emphasize the Pipeline tool which allows the user to chain together all the steps of a machine learning pipeline including preprocessing, dimensionality reduction, feature selection, and model fitting.

Plan next year! This has been a great year for RTA. We now have 100 members on Meetup, and we’ve had some amazing speakers and guests. Help us plan to make the group even bigger and better next year.

November is RTA planning month! Join us for a lunchtime roundtable on where the analytics field is heading and what we should do next year.

Dahl Winters presented: “Scaling the Big Data Mountain”.

In this whirlwind hour I will attempt to blaze a trail through the wilderness that is big data science.  Given a mountain of unstructured data and the jungle of options in the Hadoop ecosystem, it can be difficult to know which tools to use for which investigations.  We will take a guided tour of the most common Hadoop use cases, peer into NoSQL and graph databases, march over to machine learning, avoid sinking into deep learning, and cover some of the classification and clustering algorithms I’ve worked with in my big data explorations.  If you can survive this hour unscathed, you will be that much more prepared to tackle your own big data mountain.

“Big Data Analytics and CyberSecurity”

No food at this event. After party instead at Trali Irish Pub's NEW LOCATION.

Big data is expected to play a crucial role in the cybersecurity landscape. Learn how the security industry is using big data analytics and integrating Artificial Intelligence techniques (statistical analysis, autonomic/agent-based computing, ensemble classification, game-theoretic self-optimization) within the framework of distributed, intelligent, and forward-thinking security architecture. For example, Cisco is using these techniques to create solutions in the domain of Network Behavior Analysis (NBA), in order fight against modern sophisticated attacks in today’s cyberspace, including Advanced Persistent Threats (APT), exploit kits, zero-day attacks, molymorphic malware and trojans inside the client’s network.

“Tool Throwdown: Kaggle competition – Titanic dataset”

RTA founders demonstrated their predictive modeling skills using their favorite statistics and programming tools. On display will be SAS, R, JMP, and maybe more!
Description: Analyzing the Titanic data set from the Kaggle competition.

Oscar Boykin presented: “Sketching and Streaming: building large-scale, real-time relevance features at Twitter”. 

We will discuss approximation algorithms for fast, cheap and accurate aggregation, which are used in production at Twitter. We will also briefly cover the open source software we released to do this: scalding, algebird and storm.

Dan Kelly presented: “Assessment and Comparison of Predictive Models with Binary Targets”, a practical guide for people who are doing predictive models.

Oscar Boykin is a native of Raleigh. He is currently on the analytics infrastructure team at Twitter, and co-creator the Twitter open source projects: scalding, algebird, bijection, chill, and summingbird.

Ian Cook presented: “Workshop on submitting R jobs to the cloud”

Bring your laptops, enjoy the wifi and great food, and talk about data!

Social / Networking meeting.

Adam Sobsey presented: “Sabermetrics”

Adam writes for Baseball Prospectus, one of the premier publications for baseball statistics. “Our way of understanding baseball has undergone a revolution during the last generation. The field of baseball study known as “sabermetrics” (based on the acronym of the Society for American Baseball Research) has made huge advances in our approach to the complexity of the game, much of it via more thorough and sophisticated statistical analysis (aided by technological innovations, as well). Among the results of all this study is the essential sabermetric concept of the “Replacement Player.” The Replacement Player is an important but somewhat nebulous platonic ideal. The prevailing agreement is that he is basically good enough to play at the Triple-A minor-league level — the highest level below the major leagues — but does not have the skills to succeed for long stretches in the majors themselves. As it happens, the Durham Bulls are a Triple-A baseball team, all of its players striving to surpass and escape “replacement level” baseball. My talk will discuss some of the ways in which sabermetrics has changed our understanding of the game of baseball for the good, and some of the ways in which that understanding is still a work in progress–all against the very real backdrop of the men playing the game itself.”

John D. Cook: Information is Cheap, Meaning is Expensive:How to Hire and Work With an Analyst (without breaking the bank)

More and more companies are investing in information, through better databases and more robust data tools. Many are finding, however, that extracting meaning from all that information is more difficult than they thought. There are many analysts who can assist–either as freelancers or employees–but how do you know you’re hiring the right talent? Should you hire a fill-time analyst or a contractor? How much should you pay? What skills should they have?

John D. Cook has over 20 years of experience applying mathematics to real-world problems. He has worked with firms large and small, using his skills and expertise to turn the data they have into the information the need. During this question and answer session, Johnwill discuss how to connect with the right talent, how to budget for an analysis project, and what to expect from an expert analyst.

Michael Blanks presented: “Open Data & Government”.

John Sall presented: “From Big Data to Big Statistics”.

When you scale up the analysis, you have a lot of issues to address. When you have a lot of data, even a small difference is significant. When you screen a lot of hypotheses, adjusting for selection or multiple test bias is an issue. When you have a lot of bad data, making the analysis automatically robust becomes important. When you have big data, you need to make the computer work fast to get the job done. When you have thousands of results, you need to create compact summaries to show you all the results in one page, or at least produce the results sorted by significance. All these issues need to be resolved and the solutions encapsulated into a workflow for engineers and scientists that deal with more data each year.

John Sall is a co-founder and executive vice president of SAS Institute. He leads the JMP Division of SAS.

Bruce Connor led a discussion on "analytics for polling data".

This discussion was focused around and the methods behind Nate Silver’s election predictions. Participants were invited to discuss the methods and their experience with other applications of the same methods.

Melinda Thielbar presented: “Data Science is not a Fad. Let’s Keep it That Way”.

This presentation discusses the technical details of data science, in context with time series analysis and statistical modeling. A really good presentation for anyone interested in a hype-free primer on data science.

Linda Schumacher presented: “Running a Kaggle Competition team

RTA will be organizing a Kaggle team this year! Anyone who is interested in joining the team or just learning more about Kaggle will benefit from this meeting.

Eric Yount presented: “Analytic Methods for Clinical Data

This will be very informative for those who are primarily working in data mining and business analytics. The techniques Eric will discuss and the reasoning behind them present a different way of looking at data. Clinical trials experts will have an opportunity to discuss the process of collecting and analyzing clinical data.

"Educating Analysts: How Can Schools Prepare Students for a Quantitative Career?"
"Why the Future Will Convert Better" by Martin (Marty) Smith
"Applications of R and R Mini Hack-A-Thon", led by Ian Cook, TIBCO Spotfire
Presentation by MaxPoint Interactive
"Have an Idea, Need an Idea"
First Triangle Analysts Social/Networking Meeting !
