Home » Posts tagged 'interested'
Tag Archives: interested
Here is an interview with Jason Kuo who works with SAP Analytics as Group Solutions Marketing Manager. Jason answers questions on SAP Analytics and it’s increasing involvement with R statistical language.
Ajay- What made you choose R as the language to tie important parts of your technology platform like HANA and SAP Predictive Analysis. Did you consider other languages like Julia or Python.
Jason- It’s the most popular. Over 50% of the statisticians and data analysts use R. With 3,500+ algorithms its arguably the most comprehensive statistical analysis language. That said,we are not closing the door on others.
Ajay- When did you first start getting interested in R as an analytics platform?
Jason- SAP has been tracking R for 5+ years. With R’s explosive growth over the last year or two, it made sense for us to dramatically increase our investment in R.
Ajay- Can we expect SAP to give back to the R community like Google and Revolution Analytics does- by sponsoring Package development or sponsoring user meets and conferences?
Will we see SAP’s R HANA package in this year’s R conference User 2012 in Nashville
Jason- Yes. We plan to provide a specific driver for HANA tables for input of the data to native R. This planned for end of 2012. We’ll then review our event strategy. SAP has been a sponsor of Predictive Analytics World for several years and was indeed a founding sponsor. We may be attending the year’s R conference in Nashville.
Ajay- What has been some of the initial customer feedback to your analytics expansion and offerings.
Jason- We have completed two very successful Pilots of the R Integration for HANA with two of SAP’s largest customers.
Jason has over 15 years of BI and Data Warehousing industry experience. Having worked at Oracle, Business Objects, and now SAP, Jason has been involved in numerous technical marketing roles involving performance management dashboards, information management, text analysis, predictive analytics, and now big data. He has a bachelor’s of science in operations research from the University of Michigan.
Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.
The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.
After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23) into data frame numeric formats, I found that the way to handle Date-Time conversions in R is
The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45- By default R will add today’s date to it.) while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).
this can be handled using the as.difftime function
or to get purely numeric values so we can do numeric analysis (like summary)
(#Maybe there is a more elegant way here- but I dont know)
The kind of data is usually one we get in web analytics for average time on site , etc.
for factor variables
Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504
The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.
Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%
again we use the gsub function to replace the % value in the string with (nothing).
If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.
An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.
iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.
The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.