Home » Posts tagged 'Data set'
Tag Archives: Data set
I almost missed this because of my vacation and traveling
Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)
Great New Graphical Plotters
and some flashy work
and a great series of educational lectures
A Simple Explanation of Decision Tree Modeling based on Entropies
Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)
Logistic Regression for Business Analytics using RapidMiner
Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.
and lastly a new research project for collaborative data mining
e-LICO Architecture and Components
The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.
- Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
- Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
- Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.
The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.
You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL
Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.
Just click on the products in the overview below in order to get more information about Rapid-I products.
An amazing example of R being used sucessfully in combination (and not is isolation) with other enterprise software is the add-ins functionality of JMP and it’s R integration.
See the following JMP add-ins which use R
JMP Add-in: Multidimensional Scaling using R
This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).
|Submitted by: Kelci Miclaus||Initiative: All|
|Application: Add-Ins||Analysis: Exploratory Data Analysis|
Chernoff Faces Add-in
One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.
|Submitted by: Clay Barker||Initiative: All|
|Application: Add-Ins||Analysis: Data Visualization|
Support Vector Machine for Classification
By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071′ should be installed before running this dialog. The package can be found from http://cran.r-project.org/web/packages/e1071/index.html.
|Submitted by: Jong-Seok Lee||Initiative: All|
|Application: Add-Ins||Analysis: Exploratory Data Analysis/Mining|
Penalized Regression Add-in
This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.
|Submitted by: Clay Barker||Initiative: All|
|Application: Add-Ins||Analysis: Regression|
MP Addin: Univariate Nonparametric Bootstrapping
This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the boot.ci() function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.
|Submitted by: Kelci Miclaus||Initiative: All|
|Application: Add-Ins||Analysis: Basic Statistics|
All you contest junkies, R lovers and general change the world people, here’s a new contest to use R in a business application
REVOLUTION ANALYTICS LAUNCHES “APPLICATIONS OF R IN BUSINESS” CONTEST
$20,000 in Prizes for Users Solving Business Problems with R
PALO ALTO, Calif. – September 1, 2011 – Revolution Analytics, the leading commercial provider of R software, services and support, today announced the launch of its “Applications of R in Business” contest to demonstrate real-world uses of applying R to business problems. The competition is open to all R users worldwide and submissions will be accepted through October 31. The Grand Prize winner for the best application using R or Revolution R will receive $10,000.
The bonus-prize winner for the best application using features unique to Revolution R Enterprise – such as itsbig-data analytics capabilities or its Web Services API for R – will receive $5,000. A panel of independent judges drawn from the R and business community will select the grand and bonus prize winners. Revolution Analytics will present five honorable mention prize winners each with $1,000.
“We’ve designed this contest to highlight the most interesting use cases of applying R and Revolution R to solving key business problems, such as Big Data,” said Jeff Erhardt, COO of Revolution Analytics. “The ability to process higher-volume datasets will continue to be a critical need and we encourage the submission of applications using large datasets. Our goal is to grow the collection of online materials describing how to use R for business applications so our customers can better leverage Big Analytics to meet their analytical and organizational needs.”
To enter Revolution Analytics’ “Applications of R in Business” competition (more…)
Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )
Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.
Dan- When I was in graduate school studying econometrics at Harvard, a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities. Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants. Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.
Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.
Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.
The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet. The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size. The new methodology will substantially outperform truncation.
The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.
Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.
Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.
In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.
Ajay- I know you have created your own software. But are there other software that you use or liked to use?
Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.
Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?
Dan- First, we have taken great pains to make our software reliable and we make every effort to avoid problems related to bugs. Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.
Ajay- What do you do to relax and unwind?
Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.
Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.
His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.
I am still angry with THE netflix for 1 mill I lost out. No sweat! this time the money is 3 times as much, it is legit, and yes baby you can change the world, make it a better place and get rich.! see details below-http://www.heritagehealthprize.com/c/hhp/Data
HERITAGE HEALTH PRIZE DATA FILES
DATA FILES (CLICK TO DOWNLOAD)
You must accept this competition’s rules before you’ll be able to download data files.
IMPORTANT NOTE: The information provided below is intended only to provide general guidance to participants in the Heritage Health Prize Competition and is subject to the Competition Official Rules. Any capitalized term not defined below is defined in the Competition Official Rules. Please consult the Competition Official Rules for complete details.
Heritage Provider Network is providing Competition Entrants with deidentified member data collected during a forty-eight month period that is allocated among three data sets (the “Data Sets”). Competition Entrants will use the Data Sets to develop and test their algorithms for accurately predicting the number of days that the members will spend in a hospital (inpatient or emergency room visit) during the 12-month period following the Data Set cut-off date.
HHP_release2.zip contains the latest files, so you can ignore HHP_release1.zip. SampleEntry.CSV shows you how an entry should look.
Data Sets will be released to Entrants after registration on the Website according to the following schedule:
April 4, 2011 Claims Table – Y1 and DaysInHospital Table – Y2
May 4, 2011
All other Data Sets except Labs Table and Rx Table
The $3 million Heritage Health Prize opens to entries
By now, people have had a good chance to poke around the first portion of the data. Now the fun starts! HPN have released two more years’-worth of data, set the accuracy threshold and are opening up the competition to entries. The data are available from the Heritage Health Prize page. Good luck to all participants!
The Deloitte/FIDE Chess Ratings Competition results
ICDAR 2011 Competition Results
Revolution R Enterprise
As many of you know, Kaggle offers a free platform, Kaggle-in-Class, for instructors who want to host competitions for their students. For those interested in hearing more about the use of Kaggle-in-Class as a teaching tool, Susan Holmes and Nelson Ray from Stanford University share their experience in a webinar organized by the Consortium for the Advancement of Undergraduate Statistics Education.
- Data Mining to change the world of health care (decisionstats.com)
- Heritage Provider Network Announces the Heritage Health Prize Will Include $230,000 in Progress Prizes (prnewswire.com)
- $3.2M in prizes for predicting hospitalization (revolutionanalytics.com)
- Heritage Health Prize: Is $3 million enough to improve the U.S. health care system? (slate.com)