Home » Posts tagged 'R'
Tag Archives: R
NumFocus The Python Statistical Community
I really liked the mature design, and foundation of this charitable organization. While it is similar to FOAS in many ways (http://www.foastat.org/projects.html) I like the projects . Excellent projects and some of which I think should be featured in Journal of Statistical Software (since there is a seperate R Journal) unless it wants to be overtly R focused.
In the same manner I think some non Python projects should try and reach out to NumFocus (if it is not wanting to be so PyFocused)
Here it is NumFocus
NumFOCUS supports and promotes worldclass, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a nonprofit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.
Money donated through NumFOCUS goes to sponsor things like:
 Coding sprints (food and travel)
 Technical fellowships (sponsored students and mentors to work on code)
 Equipment grants (to developers and projects)
 Conference attendance for students (to PyData, SciPy, and other conferences)
 Fees for continuous integration and other software engineering tools
 Documentation development
 Webpage hosting and bandwidth fees for projects
Core Projects
NumPy
NumPy is the fundamental package needed for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional container of generic data. Arbitrary datatypes can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Repositories for NumPy binaries: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, a variety of versions – http://sourceforge.net/projects/numpy/files/NumPy/, version 1.6.1 – http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/.
SciPy
SciPy is opensource software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast Ndimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many userfriendly and efficient numerical routines such as routines for numerical integration and optimization.
Matplotlib
2D plotting library for Python that produces high quality figures that can be used in various hardcopy and interactive environments. matplolib is compatiable with python scripts and the python and ipython shells.
IPython
High quality open source python shell that includes tools for high level and interactive parallel computing.
SymPy
SymPy is a Python library for symbolic mathematics. It aims to become a fullfeatured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.
Other Projects
Cython
Cython is a language based on Pyrex that makes writing C extensions for Python as easy as writing them in Python itself. Cython supports calling C functions and declaring C types on variables and class attributes, allowing the compiler to generate very efficient C code from Cython code.
pandas
pandas is an open source, BSDlicensed library providing highperformance, easytouse data structures and data analysis tools for the Python programming language.
PyTables
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an Pythonic interface combined with C / Cython extensions for the performancecritical parts of the code. This makes it a fast, yet extremely easy to use tool for very large amounts of data. http://pytables.github.com/
scikitimage
Free highquality and peerreviewed volunteer produced collection of algorithms for image processing.
scikitlearn
Module designed for scientific pythons that provides accesible solutions to machine learning problems.
ScikitsStatsmodels
Statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation of statistical models.
Spyder
Interactive development environment for Python that features advanced editing, interactive testing, debugging and introspection capabilities, as well as a numerical computing environment made possible through the support of Ipython, NumPy, SciPy, and matplotlib.
Theano
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently.
Associated Projects
NumFOCUS is currently looking for representatives to enable us to promote the following projects. For information contact us at: info@NumFOCUS.org.
Sage
Open source mathematics sofware system that combines existing opensource packages into a Pythonbased interface.
NetworkX
NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
Python(X,Y)
Free scientific and engineering development software used for numerical computations, and analysis and visualization of data using the Python programmimg language.
Iris for Big Data #rstats #bigdata
Quote of the Day
it is impossible to be a data scientist without knowing iris
#Anonymous #Quotes
Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.
http://www.revolutionanalytics.com/subscriptions/datasets/
http://packages.revolutionanalytics.com/datasets/
Site was updated so here are the new links
while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.
Overall there can be a R package (like a Big Data version of the famous datasets package in R)
But a nice and very useful effort
Revolution R Datasets
 ../
 AirOnTime87to12/ 09Nov2013 00:46 
 AirOnTimeCSV2012/ 09Nov2013 00:30 
 AirOnTime2012.xdf 08Nov2013 18:08 190110335
 AirOnTime7Pct.xdf 08Nov2013 17:42 103317987
 AirlineData87to08.tar.gz 03May2013 21:05 5521408
 AirlineData87to08.zip 09May2013 14:59 1802240
 AirlineData87to08_11811.tar.gz 08Nov2013 03:27 1428527359
 AirlineData87to08_83010.zip 08Nov2013 06:37 1477052425
 AirlineDataSubsample.xdf 08Nov2013 07:27 390789536
 Census5PCT2000.tar.gz 08Nov2013 10:55 871208970
 Census5PCT2000.zip 08Nov2013 12:52 925929427
 CensusUS5Pct2000.xdf 08Nov2013 21:27 1204906764
 ccFraud.csv 23Apr2013 20:57 291737157
 ccFraudScore.csv 23Apr2013 21:10 273848249
 ccFraudScore10_CreateLoadTableQuotedColumns.fas..> 23Apr2013 21:10 981
 ccFraud_CreateLoadTable_QuotedColumns.fastload 23Apr2013 21:10 984
 index.php.txt 09May2013 22:17 3983
 mortDefault.tar.gz 08Nov2013 12:59 61585580
 mortDefault.zip 08Nov2013 13:08 63968310
More code
http://blog.revolutionanalytics.com/2013/08/bigdatasetsforr.html
Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.
Using ifelse in R for creating new variables #rstats #data #manipulation
The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable
> data(iris)
> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
> table(iris$Type)
Big Flower Small Flower
77 73
The parameters of ifelse is quite simple
Usage
ifelse(test, yes, no)
Arguments
test
an object which can be coerced to logical mode.
yes
return values for true elements of test.
no
return values for false elements of tes
Using R for Cricket Analysis #rstats #IPL
#Downloading the Data for batting across all formats of cricket library(XML) url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;template=results;type=batting" tables=readHTMLTable(url,stringsAsFactors = F) #Note we wrote stringsAsFactors=F in this to avoid getting factor variables, #since we will need to convert these variables to numeric variables table2=tables$"Overall figures" rm(tables) #Creating new variables from Span table2$Debut=as.numeric(substr(table2$Span,1,4)) table2$LastYr=as.numeric(substr(table2$Span,6,10)) table2$YrsPlayed=table2$LastYrtable2$Debut #Creating New Variables. In cricket a not out score is denoted by * which can cause data quality error. #This is treated by grepl for finding and gsub for removing the *. #Note the double \ to escape regex charachter table2$HSNotOut=grepl("\\*",table2$HS) table2$HS2=gsub("\\*","",table2$HS) #Creating a FOR Loop (!) to convert variables to numeric variables for (i in 3:17) { + table2[, i] < as.numeric(table2[, i]) + } and we see why Sachin Tendulkar is the best (by using ggplot via Deducer)
Also see
 http://decisionstats.com/2013/04/14/usingrforcricketanalysisrstats/
 http://decisionstats.com/2012/04/07/cricinfostatsgurudatabaseforstatisticalandgraphicalanalysi

Freaknomics Challenge
 Prove match fixing does not and cannot exist in IPL
 Create an ideal fantasy team
Using R for Cricket Analysis #rstats
ESPN Crincinfo is the best site for cricket data (you can see an earlier detailed post on the database here http://decisionstats.com/2012/04/07/cricinfostatsgurudatabaseforstatisticalandgraphicalanalysis/ ), and using the XML package in R we can easily scrape and manipulate data
Here is the code.
library(XML) url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=6;template=results;type=batting" #Note I can also break the url string and use paste command to modify this url with parameters tables=readHTMLTable(url) tables$"Overall figures" #Now see this since I only got 50 results in each page, I look at the url of next page table1=tables$"Overall figures" url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;page=2;team=6;template=results;type=batting" tables=readHTMLTable(url) table2=tables$"Overall figures" #Now I need to join these two tables vertically table3=rbind(table1,table2) NoteI can also automate the web scraping . Now the data is within R, we can use something like Deducer to visualize.
Created by Pretty R at insideR.org
R 3.0 launched #rstats
The 3.0 Era for R starts today! Changes include better Big Data support.
Read the NEWS here
install.packages()
has a new argumentquiet
to reduce the amount of output shown. New functions
cite()
andciteNatbib()
have been added, to allow generation of intext citations from"bibentry"
objects. Acite()
function may be added tobibstyle()
environments. merge()
works in more cases where the data frames include matrices. (Wish of PR#14974.)sample.int()
has some support for n >= 2^31: see its help for the limitations.A different algorithm is used for(n, size, replace = FALSE, prob = NULL)
forn > 1e7
andsize <= n/2
. This is much faster and uses less memory, but does give different results.list.files()
(akadir()
) gains a new optional argumentno..
which allows to exclude"."
and".."
from listings. Profiling via
Rprof()
now optionally records information at the statement level, not just the function level. available.packages()
gains a"license/restricts_use"
filter which retains only packages for which installation can proceed solely based on packages which are guaranteed not to restrict use. File ‘share/licenses/licenses.db’ has some clarifications, especially as to which variants of ‘BSD’ and ‘MIT’ is intended and how to apply them to packages. The problematic licence ‘Artistic1.0’ has been removed.
 The
breaks
argument inhist.default()
can now be a function that returns the breakpoints to be used (previously it could only return the suggested number of breakpoints).
LONG VECTORS
This section applies only to 64bit platforms.
 There is support for vectors longer than 2^31 – 1 elements. This applies to raw, logical, integer, double, complex and character vectors, as well as lists. (Elements of character vectors remain limited to 2^31 – 1 bytes.)
 Most operations which can sensibly be done with long vectors work: others may return the error ‘long vectors not supported yet’. Most of these are because they explicitly work with integer indices (e.g.
anyDuplicated()
andmatch()
) or because other limits (e.g. of character strings or matrix dimensions) would be exceeded or the operations would be extremely slow. length()
returns a double for long vectors, and lengths can be set to 2^31 or more by the replacement function with a double value. Most aspects of indexing are available. Generally doublevalued indices can be used to access elements beyond 2^31 – 1.
 There is some support for matrices and arrays with each dimension less than 2^31 but total number of elements more than that. Only some aspects of matrix algebra work for such matrices, often taking a very long time. In other cases the underlying Fortran code has an unstated restriction (as was found for complex
svd()
). dist()
can produce dissimilarity objects for more than 65536 rows (but for examplehclust()
cannot process such objects).serialize()
to a raw vector is unlimited in size (except by resources). The Clevel function
R_alloc
can now allocate 2^35 or more bytes. agrep()
andgrep()
will return double vectors of indices for long vector inputs. Many calls to
.C()
have been replaced by.Call()
to allow long vectors to be supported (now or in the future). Regrettably several packages had copied the nonAPI.C()
calls and so failed. .C()
and.Fortran()
do not accept long vector inputs. This is a precaution as it is very unlikely that existing code will have been written to handle long vectors (and the R wrappers often assume thatlength(x)
is an integer). Most of the methods for
sort()
work for long vectors.
rank()
,sort.list()
andorder()
support long vectors (slowly except for radix sorting).sample()
can do uniform sampling from a long vector.
PERFORMANCE IMPROVEMENTS
 More use has been made of R objects representing registered entry points, which is more efficient as the address is provided by the loader once only when the package is loaded.
This has been done for packages
base
,methods
,splines
andtcltk
: it was already in place for the other standard packages.Since these entry points are always accessed by the R entry points they do not need to be in the load table which can be substantially smaller and hence searched faster. This does mean that
.C
/.Fortran
/.Call
calls copied from earlier versions of R may no longer work – but they were never part of the API.  Many
.Call()
calls in package base have been migrated to.Internal()
calls. solve()
makes fewer copies, especially whenb
is a vector rather than a matrix.eigen()
makes fewer copies if the input has dimnames. Most of the linear algebra functions make fewer copies when the input(s) are not double (e.g. integer or logical).
 A foreign function call (
.C()
etc) in a package without aPACKAGE
argument will only look in the first DLL specified in the ‘NAMESPACE’ file of the package rather than searching all loaded DLLs. A few packages neededPACKAGE
arguments added.  The
@<
operator is now implemented as a primitive, which should reduce some copying of objects when used. Note that the operator object must now be in package base: do not try to import it explicitly from package methods.
SIGNIFICANT USERVISIBLE CHANGES
 Packages need to be (re)installed under this version (3.0.0) of R.
 There is a subtle change in behaviour for numeric index values 2^31 and larger. These never used to be legitimate and so were treated as
NA
, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, andx[2^31] < y
will now extend the vector on a 64bit platform and give an error on a 32bit one.  It is now possible for 64bit builds to allocate amounts of memory limited only by the OS. It may be wise to use OS facilities (e.g.
ulimit
in abash
shell,limit
incsh
), to set limits on overall memory consumption of an R process, particularly in a multiuser environment. A number of packages need a limit of at least 4GB of virtual memory to load.64bit Windows builds of R are by default limited in memory usage to the amount of RAM installed: this limit can be changed by commandline option –maxmemsize or setting environment variable R_MAX_MEM_SIZE.