Home » Posts tagged 'decisionstats'
Tag Archives: decisionstats
I was picking up some funny activity on my web analytics, so to make it easier for readers, here is the entire Decisionstats wordpress xml file zipped. You can download it, unzip and then read it in any wordpress reader to read at your leisure.
Updated- There seems to be unusual traffic activity on my poetry blog To make it more convenient for readers , you can download that as a zipped WordPress XML file here-
I had a chance to dekko the new startup BigML https://bigml.com/ and was suitably impressed by the briefing and my own puttering around the site. Here is my review-
1) The website is very intutively designed- You can create a dataset from an uploaded file in one click and you can create a Decision Tree model in one click as well. I wish other cloud computing websites like Google Prediction API make design so intutive and easy to understand. Also unlike Google Prediction API, the models are not black box models, but have a description which can be understood.
2) It includes some well known data sources for people trying it out. They were kind enough to offer 5 invite codes for readers of Decisionstats ( if you want to check it yourself, use the codes below the post, note they are one time only , so the first five get the invites.
BigML is still invite only but plan to get into open release soon.
3) Data Sources can only be by uploading files (csv) but they plan to change this hopefully to get data from buckets (s3? or Google?) and from URLs.
4) The one click operation to convert data source into a dataset shows a histogram (distribution) of individual variables.The back end is clojure , because the team explained it made the easiest sense and fit with Java. The good news (?) is you would never see the clojure code at the back end. You can read about it from http://clojure.org/
As cloud computing takes off (someday) I expect clojure popularity to take off as well.
Clojure is a dialect of Lisp
5) As of now decision trees is the only distributed algol, but they expect to roll out other machine learning stuff soon. Hopefully this includes regression (as logit and linear) and k means clustering. The trees are created and pruned in real time which gives a slightly animated (and impressive effect). and yes model building is an one click operation.
The real time -live pruning is really impressive and I wonder why /how it can ever be replicated in other software based on desktop, because of the sheer interactive nature.
Making the model is just half the work. Creating predictions and scoring the model is what is really the money-earner. It is one click and customization is quite intuitive. It is not quite PMML compliant yet so I hope some Zemanta like functionality can be added so huge amounts of models can be applied to predictions or score data in real time.
If you are a developer/data hacker, you should check out this section too- it is quite impressive that the designers of BigML have planned for API access so early.
BigML.io gives you:
- Secure programmatic access to all your BigML resources.
- Fully white-box access to your datasets and models.
- Asynchronous creation of datasets and models.
- Near real-time predictions.
Note: For your convenience, some of the snippets below include your real username and API key.
Please keep them secret.
BigML.io conforms to the design principles of Representational State Transfer (REST). BigML.io is enterely HTTP-based.
BigML.io gives you access to four basic resources: Source, Dataset, Model and Prediction. You cancreate, read, update, and delete resources using the respective standard HTTP methods: POST, GET,PUT and DELETE.
All access to BigML.io must be performed over HTTPS
and https://bigml.com/developers/quick_start ( In think an R package which uses JSON ,RCurl would further help in enhancing ease of usage).
Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.
Check out https://bigml.com/ for yourself to see.
The invite codes are here -one time use only- first five get the invites- so click and try your luck, machine learning on the cloud.
If you dont get an invite (or it is already used, just leave your email there and wait a couple of days to get approval)
I almost missed this because of my vacation and traveling
Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)
Great New Graphical Plotters
and some flashy work
and a great series of educational lectures
A Simple Explanation of Decision Tree Modeling based on Entropies
Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)
Logistic Regression for Business Analytics using RapidMiner
Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.
and lastly a new research project for collaborative data mining
e-LICO Architecture and Components
The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.
- Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
- Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
- Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.
The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.
You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL
Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.
Just click on the products in the overview below in order to get more information about Rapid-I products.
Here is an interview with JJ Allaire, founder of RStudio. RStudio is the IDE that has overtaken other IDE within the R Community in terms of ease of usage. On the eve of their latest product launch, JJ talks to DecisionStats on RStudio and more.
Ajay- So what is new in the latest version of RStudio and how exactly is it useful for people?
JJ- The initial release of RStudio as well as the two follow-up releases we did last year were focused on the core elements of using R: editing and running code, getting help, and managing files, history, workspaces, plots, and packages. In the meantime users have also been asking for some bigger features that would improve the overall work-flow of doing analysis with R. In this release (v0.95) we focused on three of these features:
Projects. R developers tend to have several (and often dozens) of working contexts associated with different clients, analyses, data sets, etc. RStudio projects make it easy to keep these contexts well separated (with distinct R sessions, working directories, environments, command histories, and active source documents), switch quickly between project contexts, and even work with multiple projects at once (using multiple running versions of RStudio).
Version Control. The benefits of using version control for collaboration are well known, but we also believe that solo data analysis can achieve significant productivity gains by using version control (this discussion on Stack Overflow talks about why). In this release we introduced integrated support for the two most popular open-source version control systems: Git and Subversion. This includes changelist management, file diffing, and browsing of project history, all right from within RStudio.
Code Navigation. When you look at how programmers work a surprisingly large amount of time is spent simply navigating from one context to another. Modern programming environments for general purpose languages like C++ and Java solve this problem using various forms of code navigation, and in this release we’ve brought these capabilities to R. The two main features here are the ability to type the name of any file or function in your project and go immediately to it; and the ability to navigate to the definition of any function under your cursor (including the definition of functions within packages) using a keystroke (F2) or mouse gesture (Ctrl+Click).
Ajay- What’s the product road map for RStudio? When can we expect the IDE to turn into a full fledged GUI?
JJ- Linus Torvalds has said that “Linux is evolution, not intelligent design.” RStudio tries to operate on a similar principle—the world of statistical computing is too deep, diverse, and ever-changing for any one person or vendor to map out in advance what is most important. So, our internal process is to ship a new release every few months, listen to what people are doing with the product (and hope to do with it), and then start from scratch again making the improvements that are considered most important.
Right now some of the things which seem to be top of mind for users are improved support for authoring and reproducible research, various editor enhancements including code folding, and debugging tools.
What you’ll see is us do in a given release is to work on a combination of frequently requested features, smaller improvements to usability and work-flow, bug fixes, and finally architectural changes required to support current or future feature requirements.
While we do try to base what we work on as closely as possible on direct user-feedback, we also adhere to some core principles concerning the overall philosophy and direction of the product. So for example the answer to the question about the IDE turning into a full-fledged GUI is: never. We believe that textual representations of computations provide fundamental advantages in transparency, reproducibility, collaboration, and re-usability. We believe that writing code is simply the right way to do complex technical work, so we’ll always look for ways to make coding better, faster, and easier rather than try to eliminate coding altogether.
Ajay -Describe your journey in science from a high school student to your present work in R. I noticed you have been very successful in making software products that have been mostly proprietary products or sold to companies.
Why did you get into open source products with RStudio? What are your plans for monetizing RStudio further down the line?
JJ- In high school and college my principal areas of study were Political Science and Economics. I also had a very strong parallel interest in both computing and quantitative analysis. My first job out of college was as a financial analyst at a government agency. The tools I used in that job were SAS and Excel. I had a dim notion that there must be a better way to marry computation and data analysis than those tools, but of course no concept of what this would look like.
From there I went more in the direction of general purpose computing, starting a couple of companies where I worked principally on programming languages and authoring tools for the Web. These companies produced proprietary software, which at the time (between 1995 and 2005) was a workable model because it allowed us to build the revenue required to fund development and to promote and distribute the software to a wider audience.
By 2005 it was however becoming clear that proprietary software would ultimately be overtaken by open source software in nearly all domains. The cost of development had shrunken dramatically thanks to both the availability of high-quality open source languages and tools as well as the scale of global collaboration possible on open source projects. The cost of promoting and distributing software had also collapsed thanks to efficiency of both distribution and information diffusion on the Web.
When I heard about R and learned more about it, I become very excited and inspired by what the project had accomplished. A group of extremely talented and dedicated users had created the software they needed for their work and then shared the fruits of that work with everyone. R was a platform that everyone could rally around because it worked so well, was extensible in all the right ways, and most importantly was free (as in speech) so users could depend upon it as a long-term foundation for their work.
So I started RStudio with the aim of making useful contributions to the R community. We started with building an IDE because it seemed like a first-rate development environment for R that was both powerful and easy to use was an unmet need. Being aware that many other companies had built successful businesses around open-source software, we were also convinced that we could make RStudio available under a free and open-source license (the AGPLv3) while still creating a viable business. At this point RStudio is exclusively focused on creating the best IDE for R that we can. As the core product gets where it needs to be over the next couple of years we’ll then also begin to sell other products and services related to R and RStudio.
In 1995 Joseph J. (JJ) Allaire co-founded Allaire Corporation with his brother Jeremy Allaire, creating the web development tool ColdFusion. In March 2001, Allaire was sold to Macromedia where ColdFusion was integrated into the Macromedia MX product line. Macromedia was subsequently acquired by Adobe Systems, which continues to develop and market ColdFusion.
After the sale of his company, Allaire became frustrated at the difficulty of keeping track of research he was doing using Google. To address this problem, he co-founded Onfolio in 2004 with Adam Berrey, former Allaire co-founder and VP of Marketing at Macromedia.
On March 8, 2006, Onfolio was acquired by Microsoft where many of the features of the original product are being incorporated into the Windows Live Toolbar. On August 13, 2006, Microsoft released the public beta of a new desktop blogging client called Windows Live Writer that was created by Allaire’s team at Microsoft.
Starting in 2009, Allaire has been developing a web-based interface to the widely used R technical computing environment. A beta version of RStudio was publicly released on February 28, 2011.
JJ Allaire received his B.A. from Macalester College (St. Paul, MN) in 1991.
RStudio is an integrated development environment (IDE) for R which works with the standard version of R available from CRAN. Like R, RStudio is available under a free software license. RStudio is designed to be as straightforward and intuitive as possible to provide a friendly environment for new and experienced R users alike. RStudio is also a company, and they plan to sell services (support, training, consulting, hosting) related to the open-source software they distribute.
Here is an interview with Dr Ingo Mierswa , CEO of Rapid -I and Dr Simon Fischer, Head R&D. Rapid-I makes the very popular software Rapid Miner – perhaps one of the earliest leading open source software in business analytics and business intelligence. It is quite easy to use, deploy and with it’s extensions and innovations (including compatibility with R )has continued to grow tremendously through the years.
In an extensive interview Ingo and Simon talk about algorithms marketplace, extensions , big data analytics, hadoop, mobile computing and use of the graphical user interface in analytics.
Special Thanks to Nadja from Rapid I communication team for helping coordinate this interview.( Statuary Blogging Disclosure- Rapid I is a marketing partner with Decisionstats as per the terms in http://decisionstats.com/privacy-3/)
Ajay- Describe your background in science. What are the key lessons that you have learnt while as scientific researcher and what advice would you give to new students today.
Ingo: My time as researcher really was a great experience which has influenced me a lot. I have worked at the AI lab of Prof. Dr. Katharina Morik, one of the persons who brought machine learning and data mining to Europe. Katharina always believed in what we are doing, encouraged us and gave us the space for trying out new things. Funnily enough, I never managed to use my own scientific results in any real-life project so far but I consider this as a quite common gap between science and the “real world”. At Rapid-I, however, we are still heavily connected to the scientific world and try to combine the best of both worlds: solving existing problems with leading-edge technologies.
Simon: In fact, during my academic career I have not worked in the field of data mining at all. I worked on a field some of my colleagues would probably even consider boring, and that is theoretical computer science. To be precise, my research was in the intersection of game theory and network theory. During that time, I have learnt a lot of exciting things, none of which had any business use. Still, I consider that a very valuable experience. When we at Rapid-I hire people coming to us right after graduating, I don’t care whether they know the latest technology with a fancy three-letter acronym – that will be forgotten more quickly than it came. What matters is the way you approach new problems and challenges. And that is also my recommendation to new students: work on whatever you like, as long as you are passionate about it and it brings you forward.
Ajay- How is the Rapid Miner Extensions marketplace moving along. Do you think there is a scope for people to say create algorithms in a platform like R , and then offer that algorithm as an app for sale just like iTunes or Android apps.
Simon: Well, of course it is not going to be exactly like iTunes or Android apps are, because of the more business-orientated character. But in fact there is a scope for that, yes. We have talked to several developers, e.g., at our user conference RCOMM, and several people would be interested in such an opportunity. Companies using data mining software need supported software packages, not just something they downloaded from some anonymous server, and that is only possible through a platform like the new Marketplace. Besides that, the marketplace will not only host commercial extensions. It is also meant to be a platform for all the developers that want to publish their extensions to a broader community and make them accessible in a comfortable way. Of course they could just place them on their personal Web pages, but who would find them there? From the Marketplace, they are installable with a single click.
Ingo: What I like most about the new Rapid-I Marketplace is the fact that people can now get something back for their efforts. Developing a new algorithm is a lot of work, in some cases even more that developing a nice app for your mobile phone. It is completely accepted that people buy apps from a store for a couple of Dollars and I foresee the same for sharing and selling algorithms instead of apps. Right now, people can already share algorithms and extensions for free, one of the next versions will also support selling of those contributions. Let’s see what’s happening next, maybe we will add the option to sell complete RapidMiner workflows or even some data pools…
Ajay- What are the recent features in Rapid Miner that support cloud computing, mobile computing and tablets. How do you think the landscape for Big Data (over 1 Tb ) is changing and how is Rapid Miner adapting to it.
Simon: These are areas we are very active in. For instance, we have an In-Database-Mining Extension that allows the user to run their modelling algorithms directly inside the database, without ever loading the data into memory. Using analytic databases like Vectorwise or Infobright, this technology can really boost performance. Our data mining server, RapidAnalytics, already offers functionality to send analysis processes into the cloud. In addition to that, we are currently preparing a research project dealing with data mining in the cloud. A second project is targeted towards the other aspect you mention: the use of mobile devices. This is certainly a growing market, of course not for designing and running analyses, but for inspecting reports and results. But even that is tricky: When you have a large screen you can display fancy and comprehensive interactive dashboards with drill downs and the like. On a mobile device, that does not work, so you must bring your reports and visualizations very much to the point. And this is precisely what data mining can do – and what is hard to do for classical BI.
Ingo: Then there is Radoop, which you may have heard of. It uses the Apache Hadoop framework for large-scale distributed computing to execute RapidMiner processes in the cloud. Radoop has been presented at this year’s RCOMM and people are really excited about the combination of RapidMiner with Hadoop and the scalability this brings.
Ajay- Describe the Rapid Miner analytics certification program and what steps are you taking to partner with academic universities.
Ingo: The Rapid-I Certification Program was created to recognize professional users of RapidMiner or RapidAnalytics. The idea is that certified users have demonstrated a deep understanding of the data analysis software solutions provided by Rapid-I and how they are used in data analysis projects. Taking part in the Rapid-I Certification Program offers a lot of benefits for IT professionals as well as for employers: professionals can demonstrate their skills and employers can make sure that they hire qualified professionals. We started our certification program only about 6 months ago and until now about 100 professionals have been certified so far.
Simon: During our annual user conference, the RCOMM, we have plenty of opportunities to talk to people from academia. We’re also present at other conferences, e.g. at ECML/PKDD, and we are sponsoring data mining challenges and grants. We maintain strong ties with several universities all over Europe and the world, which is something that I would not want to miss. We are also cooperating with institutes like the ITB in Dublin during their training programmes, e.g. by giving lectures, etc. Also, we are leading or participating in several national or EU-funded research projects, so we are still close to academia. And we offer an academic discount on all our products :-)
Ajay- Describe the global efforts in making Rapid Miner a truly international software including spread of developers, clients and employees.
Simon: Our clients already are very international. We have a partner network in America, Asia, and Australia, and, while I am responding to these questions, we have a training course in the US. Developers working on the core of RapidMiner and RapidAnalytics, however, are likely to stay in Germany for the foreseeable future. We need specialists for that, and it would be pointless to spread the development team over the globe. That is also owed to the agile philosophy that we are following.
Ingo: Simon is right, Rapid-I already is acting on an international level. Rapid-I now has more than 300 customers from 39 countries in the world which is a great result for a young company like ours. We are of course very strong in Germany and also the rest of Europe, but also concentrate on more countries by means of our very successful partner network. Rapid-I continues to build this partner network and to recruit dynamic and knowledgeable partners and in the future. However, extending and acting globally is definitely part of our strategic roadmap.
Dr. Ingo Mierswa is working as Chief Executive Officer (CEO) of Rapid-I. He has several years of experience in project management, human resources management, consulting, and leadership including eight years of coordinating and leading the multi-national RapidMiner developer team with about 30 developers and contributors world-wide. He wrote his Phd titled “Non-Convex and Multi-Objective Optimization for Numerical Feature Engineering and Data Mining” at the University of Dortmund under the supervision of Prof. Morik.
Dr. Simon Fischer is heading the research & development at Rapid-I. His interests include game theory and networks, the theory of evolutionary algorithms (e.g. on the Ising model), and theoretical and practical aspects of data mining. He wrote his PhD in Aachen where he worked in the project “Design and Analysis of Self-Regulating Protocols for Spectrum Assignment” within the excellence cluster UMIC. Before, he was working on the vtraffic project within the DFG Programme 1126 “Algorithms for large and complex networks”.
http://rapid-i.com/content/view/181/190/ tells you more on the various types of Rapid Miner licensing for enterprise, individual and developer versions.
(Note from Ajay- to receive an early edition invite to Radoop, click here http://radoop.eu/z1sxe)
Here is an announcement from Predictive Analytics World, the worlds largest vendor neutral conference dedicated to Predictive Analytics alone. Decisionstats has been a blog partner of PAWCON since inception. This is cool stuff!