Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.
Challenges of Analytical Data Processing-
1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.
The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.
2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.
3) Project Scope-
How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?
How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-
Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.
Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.
Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.
Package -specific syntax
update.packages() #This updates all packages
install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session
CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.
ls() #This lists all objects or datasets currently active in the R session
> names(assetsCorr) #This gives the names of variables within a dataframe
 “AssetClass” “LargeStocksUS” “SmallStocksUS”
 “CorporateBondsUS” “TreasuryBondsUS” “RealEstateUS”
 “StocksCanada” “StocksUK” “StocksGermany”
 “StocksSwitzerland” “StocksEmergingMarkets”
> str(assetsCorr) #gives complete structure of dataset
‘data.frame': 12 obs. of 11 variables:
$ AssetClass : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
$ LargeStocksUS : num 15.3 16.4 1 0 0 …
$ SmallStocksUS : num 13.49 16.64 0.66 1 0 …
$ CorporateBondsUS : num 9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
$ TreasuryBondsUS : num 8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
$ RealEstateUS : num 10.6 17.32 0.08 0.59 0.35 …
$ StocksCanada : num 10.25 19.78 0.56 0.53 -0.12 …
$ StocksUK : num 10.66 13.63 0.81 0.41 0.24 …
$ StocksGermany : num 12.1 20.32 0.76 0.39 0.15 …
$ StocksSwitzerland : num 15.01 20.8 0.64 0.43 0.55 …
$ StocksEmergingMarkets: num 16.5 36.92 0.3 0.6 0.12 …
> dim(assetsCorr) #gives dimensions observations and variable number
 12 11
str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)
head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)
summary(dataset) gives you a brief summary of all variables while
describe(dataset) gives a detailed description on the variables
simple graphics can be given by
As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).
For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.
- The data analysis path is built on curiosity, followed by action (radar.oreilly.com)
- Using Datasets in KRL (Flickr RSS) (code.kynetx.com)
- R interface to Google Chart Tools (r-bloggers.com)
- How To Get Experience Working With Large Datasets (highscalability.com)
- A portal for European government data: PublicData.eu plans (onlinejournalismblog.com)
- 5 Datasets You Can Buy and Use for SEO (and a few for free!) (seomoz.org)
- Integrated Longitudinal Database Available in Census Centers (kauffman.org)