Webinar Q&A: Prep Data Faster for AI and ML - A DataRobot Case Study

In a recent webinar, Paxata's Senior Product Marketing Manager Mike White and I demonstrated Paxata's unique ability to eliminate the most common bottleneck in AI and machine learning

80 percent of time spent on AI and machine learning efforts is wasted exploring, combining, and shaping data. Three critical but easily overcome obstacles create that problem, and Paxata makes quick work of them all. (The webinar is now available on-demand.)

After the demonstration, Mike and I took some questions—some of which we hear every now and again, so we’re posting them here for the benefit of all who attended. 

Q: What types of databases does Paxata support?

Paxata can handle nearly every type of structured data: relational databases, Snowflake, Redshift, Hadoop, and Cloudera, to name a few. You can also work with JSON or XML files, or cloud applications like Salesforce or Marketo. Local desktop CSV files can be imported as well.

Q: How does Paxata handle unstructured data?

Paxata presents data in a familiar, spreadsheet-like format, so semi-structured data (like the JSON and XML mentioned above) is no problem. Paxata will automatically detect the format and then flatten the data into a tabular view of rows and columns. Once in the tabular format, you can further shape (aggregate or pivot) the data to your liking, of course. Unstructured data—images and plain text, for example—is not currently supported.

Q: Are Paxata datasets unique to a user? Or are they governed? Can they be shared?

A: You have complete control over access to your datasets in the Paxata Library. Datasets can be shared with colleagues easily, assuming they’re granted access. Often, citizen data scientists will reuse previously created datasets as a starting point to then further refine or enrich the data for their models.

Q: Are all actions taken recorded?

A: Yes, Paxata records all data transformations automatically, regardless of how many users work on a given dataset. That’s the backbone of Paxata’s versioning, repeatability, traceability, and data audit functionalities. Plus, versioning and cataloging of raw and prepared datasets support ongoing training and monitoring of models.

Q:  What are the system requirements for Paxata?

Paxata is a native cloud-based, software-as-a-service (SaaS) application, so you can access your Paxata account using any modern web browser. Paxata is also available on Amazon Web Services (AWS) and Microsoft Azure, with other deployment options based on individual customer requirements. 

Subscriptions are priced based on data volumes. Some customers need only enough processing power to process a few million rows of data at a time; some process hundreds of millions of rows. Paxata can easily accommodate those loads.

Q: What's the difference between Paxata's intelligent data prep and DataRobot's automated feature engineering?

A: Automated feature engineering is a powerful capability centered around the discovery and extraction of explanatory features from related datasets. DataRobot utilizes embedded AI techniques to perform simple joins of datasets, and also to create additional features from dates (for instance day, day of week, month, quarter) that can provide additional insights.

However, automated feature engineering is not designed for adding features which require business-level or domain-specific understanding. It won’t combine datasets that require complex, multi-column joins and fuzzy matching. Also, standardization of categorical variables (variations in company names, for instance) require Paxata’s data prep engine. Note that feature engineering, while a critical task when preparing machine learning training data, is a component within the broader data prep domain which includes the often necessary tasks of cleaning and shaping.

Thanks again to all who joined us. For those eager to claim their place among the ranks of AI innovators, you can get started today with a free Paxata trial. Take Paxata’s Intelligent Ingest, automatic data lineage, and AI-powered recommendations for a spin, and you’ll never have to wait weeks for an IT report again.

 

Start your free trial →

New call-to-action

 

About the Author:

Piet Loubser is the VP Product Marketing – Data Preparation. Piet is building out a new category for modern data management with the market-leading Self-Service Data Preparation solution.