Using R for Cricket Analysis #rstats

ESPN Crincinfo is the best site for cricket data (you can see an earlier detailed post on the database here https://decisionstats.com/2012/04/07/cricinfo-statsguru-database-for-statistical-and-graphical-analysis/ ), and using the XML package in R we can easily scrape and manipulate data

Here is the code.

library(XML)
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=6;template=results;type=batting"
#Note I can also break the url string and use paste command to modify this url with parameters
tables=readHTMLTable(url)
tables$"Overall figures"

#Now see this- since I only got 50 results in each page, I look at the url of next page

table1=tables$"Overall figures"
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;page=2;team=6;template=results;type=batting"
tables=readHTMLTable(url)
table2=tables$"Overall figures"

#Now I need to join these two tables vertically

table3=rbind(table1,table2)

Note-I can also automate the web scraping .
Now the data is within R, we can use something like Deducer to visualize.

Created by Pretty R at inside-R.org

Topic Models in R- search documents for similarity by frequency

From the marvelous lovely Journal of Statistical Software, ignored by mainstream corporatia, but beloved to academia. here is one more interesting and very timely paper.

Can be used to grade stdudents homework, catch terrorists as in plagiarists , search engine spam linkers. Enjoy!

View this document on Scribd

Why search optimization can make you like Rebecca Black

A highly optimized blog post or web content can get you a lot of attention just like Rebecca Black’s video (provided it passes through the new quality metrics \change*/ in the Search Engine)

But if the underlying content is weak, or based on a shoddy understanding of the content-it can drive lots of horrid comments as well as ensuring that bad word of mouth is spread about the content or you/despite your hard work.

An example of this is copy and paste journalism especially in technology circles, where even a bigger Page Ranked website /blog can get away with scraping or stealing content from a lower page ranked website (or many websites) after adding a cursory “expert comment”. This is also true when someone who is basically a corporate communication specialist (or PR -public relations) person is given a techinical text and encourage to write about it without completely understanding it.

A mild technical defect in the search engine algorithm is that it does not seem to pay attention to when the content was published, so the copying website or blog actually can get by as fresher content even if it is practically has 90% of the same words). The second flaw is over punishment or manual punishment of excessive linking – this can encourage search optimization minded people to hoard links or discourage trackbacks.

A free internet is one which promotes free sharing of content and does not encourage stealing or un-authorized scraping or content copying. Unfortunately current search engine optimization can encourage scraping and content copying without paying too much attention to origin of the words.

In addition the analytical rigor by which search algorithms search your inboxes (as in search all emails for a keyword) or media rich sites (like Youtube) are quite on a different level of quality altogether. The chances of garbage results are much more while searching for media content and/or emails.

R Commercial Software

Revolution Analytics

http://www.revolutionanalytics.com/ Download- http://www.revolutionanalytics.com/downloads/ Official Screenshot-

• XL Solutions

http://www.experience-rplus.com/ Download-http://www.experience-rplus.com/down.asp Official Screenshot-

Information Builder

http://www.informationbuilders.com/products/webfocus/PredictiveModeling Official Screenshot-

Blue Reference- Inference for R

http://inferenceforr.com/default.aspx Download-http://inferenceforr.com/freetrial/default.aspx Official Screenshot-

R for Excel

http://www.statconn.com/

Download- http://rcom.univie.ac.at/download.html

Also integrates R with Word, Open Office and Excel with Scilab

Quick-R and Statmethods.net

Image via Wikipedia

I was searching for some basic syntax in R (basically cross tabs and density plots) and I came across the Quick R site.

http://www.statmethods.net/

Its really a nice site for R beginners and anyone trying to remember some syntax.

R syntax can be very simple- a histoigram is just hist(), boxplot is just boxplot() and t test is just t.test(dataset)

Here is an example from the site-

http://www.statmethods.net/graphs/density.html

# Simple Histogram hist(mtcars$mpg)

click to view

# Colored Histogram with Different Number of Bins hist(mtcars$mpg, breaks=12, col="red")

click to view

# Add a Normal Curve (Thanks to Peter Dalgaard) x <- mtcars$mpg h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal Curve") xfit<-seq(min(x),max(x),length=40) yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2)

click to view

Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.

KERNEL DENSITY PLOTS

Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

# Kernel Density Plot d <- density(mtcars$mpg) # returns the density data plot(d) # plots the results

click to view

# Filled Density Plot d <- density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon") polygon(d, col="red", border="blue")

click to view

COMPARING GROUPS VIA KERNAL DENSITY

The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.

# Compare MPG distributions for cars with # 4,6, or 8 cylinders library(sm) attach(mtcars)


# create value labels

cyl.f <- factor(cyl, levels= c(4,6,8),

labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
# plot densities

sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")

title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click colfill<-c(2:(2+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill)

click to view

It is not as exhaustive as http://cran.r-project.org/doc/manuals/R-intro.html

but it is much more simpler and easy to follow.

The site is created by Robert I. Kabacoff, Ph.D.

and he is working on a book called “R in Action”

I have received numerous requests for a hardcopy version of this site, so over the past year I have been writing a book that takes the material here and significantly expands upon it. If you are interested, early access is available.

If you have not been to that website, I recommend it highly (though the tagline or logo of R for SAS/SPSS/Stata users seems a bit familiar)-http://www.statmethods.net/index.html

Quick-R

for SAS/SPSS/Stata Users

Two Thoughts on Lisp Syntax. (kazimirmajorinc.blogspot.com)
Some Basics about Stats (psipsychologytutor.org)
Bone Density Tests: A Clue to Your Future (webmd.com)
Net Access Corporation Unveils 50,000 Square Foot, State-of-the-Art Data Center in Parsippany, New Jersey (prweb.com)
programming languages – What makes lisp macros so special – Stack Overflow (stackoverflow.com)
Thinking about Syntax (latenightpc.com)
Our minds use syntax to understand actions, just like with language [Mad Psychology] (io9.com)
Syntax highlighting for Django using Pygments (ofbrooklyn.com)
People of HTML5 – Bruce Lawson (hacks.mozilla.org)
Haskell syntax vs. Lisp syntax | LispCast (lispcast.com)

How to balance your online advertising and your offline conscience

Image via Wikipedia

I recently found an interesting example of a website that both makes a lot of money and yet is much more efficient than any free or non profit. It is called ECOSIA

If you see a website that wants to balance administrative costs plus have a transparent way to make the world better- this is a great example.

http://ecosia.org/how.php

HOW IT WORKS
You search with Ecosia.

Perhaps you click on an interesting sponsored link.

The sponsoring company pays Bing or Yahoo for the click.

Bing or Yahoo gives the bigger chunk of that money to Ecosia.

Ecosia donates at least 80% of this income to support WWF’s work in the Amazon.

If you like what we’re doing, help us spread the word!

Key facts about the park:

World’s largest tropical forest reserve (38,867 square kilometers, or about the size of Switzerland)
Home to about 14% of all amphibian species and roughly 54% of all bird species in the Amazon – not to mention large populations of at least eight threatened species, including the jaguar
Includes part of the Guiana Shield containing 25% of world’s remaining tropical rainforests – 80 to 90% of which are still pristine
Holds the last major unpolluted water reserves in the Neotropics, containing approximately 20% of all of the Earth’s water
One of the last tropical regions on Earth vastly unaltered by humans
Significant contributor to climatic regulation via heat absorption and carbon storage

http://ecosia.org/statistics.php

They claim to have donated 141,529.42 EUR !!!

http://static.ecosia.org/files/donations.pdf

Well suppose you are the Web Admin of a very popular website like Wikipedia or etc

One way to meet server costs is to say openly hey i need to balance my costs so i need some money.

The other way is to use online advertising.

I started mine with Google Adsense.

Click per milli (or CPM) gives you a very low low conversion compared to contacting ad sponsor directly.

But its a great data experiment-

as you can monitor which companies are likely to be advertised on your site (assume google knows more about their algols than you will)

which formats -banner or text or flash have what kind of conversion rates

what are the expected pay off rates from various keywords or companies (like business intelligence software, predictive analytics software and statistical computing software are similar but have different expected returns (if you remember your eco class)

NOW- Based on above data, you know whats your minimum baseline to expect from a private advertiser than a public, crowd sourced search engine one (like Google or Bing)

Lets say if you have 100000 views monthly. and assume one out of 1000 page views will lead to a click. Say the advertiser will pay you 1 $ for every 1 click (=1000 impressions)

Then your expected revenue is $100.But if your clicks are priced at 2.5$ for every click , and your click through rate is now 3 out of 1000 impressions- (both very moderate increases that can done by basic placement optimization of ad type, graphics etc)-your new revenue is 750$.

Be a good Samaritan- you decide to share some of this with your audience -like 4 Amazon books per month ( or I free Amazon book per week)- That gives you a cost of 200$, and leaves you with some 550$.

Wait! it doesnt end there- Adam Smith‘s invisible hand moves on .

You say hmm let me put 100 $ for an annual paper writing contest of $1000, donate $200 to one laptop per child ( or to Amazon rain forests or to Haiti etc etc etc), pay $100 to your upgraded server hosting, and put 350$ in online advertising. say $200 for search engines and $150 for Facebook.

Woah!

Month 1 would should see more people visiting you for the first time. If you have a good return rate (returning visitors as a %, and low bounce rate (visits less than 5 secs)- your traffic should see atleast a 20% jump in new arrivals and 5-10 % in long term arrivals. Ignoring bounces- within three months you will have one of the following

1) An interesting case study on statistics on online and social media advertising, tangible motivations for increasing community response , and some good data for study

2) hopefully better cost management of your server expenses

3)very hopefully a positive cash flow

you could even set a percentage and share the monthly (or annually is better actions) to your readers and advertisers.

go ahead- change the world!

the key paradigms here are sharing your traffic and revenue openly to everyone

donating to a suitable cause

helping increase awareness of the suitable cause

basing fixed percentages rather than absolute numbers to ensure your site and cause are sustained for years.

3 Green Search Engines (planetsave.com)
Social Enterprise Focus: Ecosia (clearlyso.com)
Yahoo and Microsoft Search Advertisers May See Rate Hike of Up To 78% (dailyfinance.com)
Return on Investment from Google Marketing (firstrate.co.nz)
The Top 10 Paid Search Features You Might Have Missed In 2010 (searchengineland.com)
Bing upgrades draw upon Facebook, other partners (thenewstribune.com)
adCenter Goes Offline During Winter Storm (seroundtable.com)
Why Bing “Likes” Facebook (technologyreview.in)
What Offline Advertisers Can Teach Online Marketers (gabrielcatalano.com)
The Environment friendly Search! (trak.in)

The auto-suggest link/tags for WP.com blogs

WordPress.com blogs have a great new option for generating tags, and links and thus improving their search engine optimization for posts.

Just go to Users-Personal Settings- and check the options shown. Thats it every time you write a post it suggests links and tags. Links are helpful for your readers (like Wikipedia links to understand dense technical jargon, or associated websites). Tags help to classify your contents so that all visitors to the web site including spiders ,search engines and your readers can search it better.

The bad thing is I need to go back to all 1025 posts on this site and auto generate tags for the archives ! Oh well. Great collaboration between zementa and Automattic for this new feature.

Please share:

Please share:

Please share:

Revolution Analytics

• XL Solutions

Information Builder

Blue Reference- Inference for R

R for Excel

Please share:

KERNEL DENSITY PLOTS

COMPARING GROUPS VIA KERNAL DENSITY

Quick-R

for SAS/SPSS/Stata Users

Related Articles

Please share:

Key facts about the park:

Related Articles

Please share:

Please share: