DataSet Operations – Selecting Data
Introduction DataSet operations are used for transforming data and are quite similar to SQL. Selecting Data Filtering Data – Applying Conditions Ordering data Aggregating data – count, sum, max, min, avg Joins – Right Outer, Left Outer, Full Outer Set operations – union, minus, intersect In all the following examples, I have used the TPC-H datasets which are an industry standard for various benchmarks. The data generators are open source and can be download from GitHub. The structure of the data is available on tpc-h website or you can directly click here and look at page 13. I have provided samples of the data files at the end of the posts. DataSet – Read options Before we get into the details of actual dataframe API let’s understand some of the configurations to read the data from a simple CSV file. There are various options available to read files. Some of them … Read more