Google stuck on Gears
Google has launched support for Droid the mobile operating system but forgot to include support for their own browser- Chromium. Atleast if you can support Windows Explorer and Firefox for Gears, surely you can add support for Gears for Chromium.Maybe with an Ad or two
.Since Al Gore invented the internet and he sits as a consultant for the California boys, maybe he can advise them as well on the anti trust investigations with Apple (cough).
Redlining in Internet Access and notes on Regression Models
This is the definition of Redlining Citation- The AD FREE Wikepedia-
Redlining is the practice of denying, or increasing the cost of, services such as banking, insurance, access to jobs,[2]access to health care,[3] or even supermarkets[4] to residents in certain, often racially determined,[5] areas. The term “redlining” was coined in the late 1960s by community activists in Chicago.[citation needed] It describes the practice of marking a red line on a map to delineate the area where banks would not invest; later the term was applied todiscrimination against a particular group of people (usually by race or sex) no matter the geography.
As of today, redlining in financial services is outlawed by the Fair Credit Lending Act which prohibits using variables in regression models which end up red-lining districts. However as far as 2005, redlining was used in Auto Insurance by using suitably disguised zip9 variables ( I carried data for 55 million American Citizens and 88 million Accounts for a major North American Automotive Insurance provider as part of an offshoring contract from Atlanta, GA in 2005).
It exists today by informal arrangements between internet service providers who carve up territories and districts. Internet access redlining is still not illegal. This is especially true in Austin ( I traveled there as a consultant last year) and Knoxville, Tennessee where I still study as a grad student.
Neither are suitably proprietary insurance and health care claim denial models used for minimizing litigation risk. Litigation risk minimization is the next level of retail logistic regression model just as predictive modeling used by political consultants during elections.
Open Source Webinar with AsterData
Learn how to make money from open source databases, some business intelligence and more business analytics in this webinare at here.
FCC Disclaimer ( even though it is one day before the rules for Bloggers come in effect)-
AsterData is an advertiser on this blog. See the ad on right.
MapReduce was released by Google in 2004 as how to do big data crunching faster.
Google is not an advertiser nor partner on this site. They are busy with mobile phones and advertising (like the TV series Mad Men.)
And yes, Sergey Brin needs to finish his Phd too.
Ponder This: IBM Research
Ponder This Challenge:
What is the minimal number, X, of yes/no questions needed to find the smallest (but more than 1*) divisor of a number between 2 and 166 (inclusive)?
We are asking for the exact answer in two cases:
In the worst case, i.e., what is the smallest number X for which we can guarantee finding it in no more than X questions?
On average, i.e., assuming that the number was chosen in uniform distribution from 2 to 166 and we want to minimize the expected number of questions.
* For example, the smallest divisor of 105 is 3, and of 103 is 103.
Update (11/05): You should find the exact divisor without knowing the number and answering “prime” is not a valid
Citation-
http://domino.research.ibm.com/Comm/wwwr_ponder.nsf/pages/index.html
A maths challenge by the boys in Blue above and also in employement news, the parent company of SPSS is opening a centre of advanced analytics right here in Washington D.C.
WASHINGTON - 10 Nov 2009: IBM (NYSE: IBM) today announced the opening of the sixth in a network of analytics solution centers – this one dedicated to helping federal agencies and other public sector organizations extract actionable insights from their data.
The new IBM Analytics Solution Center in Washington, D.C., will draw on the expertise of more than 400 IBM professionals. These will include IBM researchers, experts in advanced software platforms, and consultants with deep industry knowledge in areas such as transportation, social services, public safety, customs and border management, revenue management, defense, logistics, healthcare and education. IBM also plans to add an additional 100 professionals, through retraining or new hiring, as demand grows.
Weak Security in Internet Databases for Statisticians
A year ago while working as a virtual research assistant to Dr Vincent Granville( of Analyticbridge.com and who signed my recommendation form for University of Tennessee) I helped download almost 22000 records of almost all the statisticians and economists of the world. This included databases like American Statistical Association and Royal Society ( ASA, ACME, RS etc).
After joining University of Tennessee, i sent a sample of code and database with me by email to two professors ( one a fellow of ASA and the other an expert into internet protocols to make it an academic paper except they did not know any journal or professor who knew stuff on data scraping
)
I am publishing this now in the hope they would have plugged the gap before someone gets that kind of database and exploits for spamming or commercial mal use.
The weak link was once you were in the database using a valid login and password, you can use automated HTML capture to basically do a lot of data scraping using the iMacro macro or Firefox Plugin. Since the login were done on Christmas Eve and during year end- this also used the fact that admins were likely to overlook into analytical logs ( if they had software like clicky or were preserving logs).
Here is the code that was used for scraping the whole database for ASA ( Note the scraping was not used by me- it was sent to Dr Granville and this was an academic research project).
See complete code here- http://docs.google.com/View?id=dcvss358_335dg2xmdcp
1) Use Firefox Browser ( or Download from http://www.mozilla.com/en-US/firefox/ )
2) Install IMacros from https://addons.mozilla.org/en-US/firefox/addon/3863
3) Use the following code, paste in a notepad file and save as “macro1.iim”.
VERSION BUILD=6111213 RECORDER=FX
Note the ‘ prefix denotes commented out code
‘AUTOMATED ENTRY INTO WEBSITE IN CORRECT POSITION
TAB T=1
‘URL GOTO=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
‘TAG POS=1:TEXT FORM=NAME:frmLogin ATTR=NAME:txtUser CONTENT=USERNAME
‘SET !ENCRYPTION NO
‘TAG POS=1:PASSWORD FORM=NAME:frmLogin ATTR=NAME:txtPassword CONTENT=USERPASSWORD
‘TAG POS=1:SUBMIT FORM=NAME:frmLogin ATTR=NAME:btnSubmit&&VALUE:Login
‘TAG POS=1 ATTR=ID:el34
‘ENTER FORM INPUTS
‘TAG POS=1 FORM=NAME:frmSearch ATTR=NAME:txtState CONTENT=%CA
‘TAG POS=1:TEXT FORM=NAME:frmSearch ATTR=NAME:txtName CONTENT=b
‘TAG POS=1:SUBMIT FORM=NAME:frmSearch ATTR=NAME:btnSubmit&&VALUE:Submit
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
SET !LOOP 1
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
TAG POS=1 ATTR=TXT:Name
TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF
SET !VAR1 {{!EXTRACT}}
‘PROMPT {{!EXTRACT}}
URL GOTO={{!VAR1}}
TAG POS=1 ATTR=TXT:Name
TAG POS=R1 ATTR=TXT:* EXTRACT=TXT
TAG POS=1 ATTR=TXT:Email
TAG POS=R1 ATTR=TXT:* EXTRACT=TXT
‘PROMPT {{!EXTRACT}}
BACK
SAVEAS FOLDER=* FILE=*
4) The code should be run after logging in and after giving inputs for name (use wild card of a single alphabet say a) and state from drop down
5) Click submit to get number of records
6)Click on the IOpus Macro button next to address bar in Firefox and load the macro file above
7) Run macro ( Click on run loop button from 1 to X where X is number of records returned in step5.
Repeat Steps 4 to 7 till a single State ( which is the group by variable here ) is complete.
Go to C:\Documents and Settings\admin\My Documents\iMacros\Downloads (Check this from IMacros settings and options in your installation)
9) Rename the file index as “state.csv”
10) Open CSV file
11) Use the following Office 2003 Macro to clean the file
Sub Macro1()
‘
‘ Macro1 Macro
‘ Macro recorded 12/22/2008 by ajay
‘
‘ Keyboard Shortcut: Ctrl+q
‘
Cells.Select
Selection.Replace What:=”#NEWLINE#”, Replacement:=”", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False
Columns(“B:B”).Select
Selection.TextToColumns Destination:=Range(“B1″), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _
Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _
:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True
Columns(“C:C”).Select
Selection.TextToColumns Destination:=Range(“C1″), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _
Semicolon:=False, Comma:=False, Space:=False, Other:=False, FieldInfo _
:=Array(Array(1, 9), Array(2, 1)), TrailingMinusNumbers:=True
Columns(“B:B”).ColumnWidth = 23.71
Columns(“A:A”).EntireColumn.AutoFit
ActiveWindow.SmallScroll Down:=9
ActiveWorkbook.Save
End Sub
12) In case you have Office 2007 Use The Record Macro feature to create your unique Macro in your personal Macro Workbook, basically replacing all #NEWFILE# with space (using Ctrl+H) and using Text to columns for column 2 and column 3, with type delimited,next, treat successive delimiters as one (check box),next,do not import first column (BY selecting that column”)
13) To append lots of files into 1 file use the following R Commands
Download R from www.r-project.org
>setwd(“C:\\Documents and Settings\\admin\\My Documents\\iMacros\\Downloads”)
Note this is the same folder as in Step 8 above
>list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE,
+ recursive = FALSE, ignore.case = FALSE)
The R output is something like below
> list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE, + recursive = FALSE, ignore.case = FALSE) [1] “Automation Robot – Documents – Office Live Workspace” “Book1.xls” [3] “cala.csv” “calb.csv” [5] “calc.csv” “cald.csv” [7] “cale.csv” “calf.csv” [9] “calg.csv” “calh.csv” [11] “cali.csv” “calj.csv” [13] “calk.csv” “call.csv” [15] “calm.csv” “caln.csv” [17] “calo.csv” “calp.csv” [19] “calq.csv” “calr.csv” [21] “cals.csv” “calt.csv” [23] “calu.csv” “calv.csv” [25] “calw.csv” “calx.csv” [27] “caly.csv” “calz.csv” [29] “cola.csv” “colac.csv” [31] “colad.csv” ”colae.csv” [33] “colaf.csv” ”colag.csv” [35] “coloa.csv” ”colob.csv” [37] “index” ”login” > file.append(“coloa.csv”,”colob.csv”) [1] TRUE > file.append(“coloa.csv”,”colac.csv”) [1] TRUE > file.append(“coloa.csv”,”colad.csv”) [1] TRUE > file.append(“coloa.csv”,”colae.csv”) [1] TRUE > file.append(“coloa.csv”,”colaf.csv”) [1] TRUE > file.append(“coloa.csv”,”colag.csv”) [1] TRUE > file.append(“cala.csv”,”calb.csv”) [1] TRUE > file.append(“cala.csv”,”calc.csv”) [1] TRUE > file.append(“cala.csv”,”cald.csv”) [1] TRUE > file.append(“cala.csv”,”cale.csv”) [1] TRUE > file.append(“cala.csv”,”calf.csv”) [1] TRUE > file.append(“cala.csv”,”calg.csv”) [1] TRUE > file.append(“cala.csv”,”calh.csv”) [1] TRUE > file.append(“cala.csv”,”cali.csv”) [1] TRUE > file.append(“cala.csv”,”calj.csv”) [1] TRUE > file.append(“cala.csv”,”calk.csv”) [1] TRUE > file.append(“cala.csv”,”call.csv”) [1] TRUE > file.append(“cala.csv”,”calm.csv”) [1] TRUE > file.append(“cala.csv”,”caln.csv”) [1] TRUE > file.append(“cala.csv”,”calo.csv”) [1] TRUE > file.append(“cala.csv”,”calp.csv”) [1] TRUE > file.append(“cala.csv”,”calq.csv”) [1] TRUE > file.append(“cala.csv”,”calr.csv”) [1] TRUE > file.append(“cala.csv”,”cals.csv”) [1] TRUE > file.append(“cala.csv”,”calt.csv”) [1] TRUE > file.append(“cala.csv”,”calu.csv”) [1] TRUE > file.append(“cala.csv”,”calv.csv”) [1] TRUE > file.append(“cala.csv”,”calw.csv”) [1] TRUE > file.append(“cala.csv”,”calx.csv”) [1] TRUE > file.append(“cala.csv”,”caly.csv”) [1] TRUE > file.append(“cala.csv”,”calz.csv”) [1] TRUE
ACTUAL EXECUTION TIME REVISED MACRO
This uses multiple tabs ( using TAB T=1 and TAB T=2) to switch between Tabs. Thus you can search for a big name in Tab 1 , while Tab 2 consists of the details of the table components ( here Name and Email positioned relatively)
Execution of Loop is by the Loop Button on IMacros
VERSION BUILD=6111213 RECORDER=FX TAB T=1 SET !LOOP This sets Initial value of Loop to start from Value=1 SET !ERRORIGNORE YES Setting Errors to be Ignored ( Like in cases when Email is not present ) and thus resume the rest of code SET !EXTRACT_TEST_POPUP NO Setting Popups to be disabled. Note Popups are useful while creating the code, but reduce execution time. TAG POS=1 ATTR=TXT:Name TAG POS=R{{!LOOP}} ATTR=HREF:* EXTRACT=HREF Note here the extratced value takes position of the link (HREF) positioned at (R1) Row 1(from Loop) using the reference from Text ( In Strong) Name SET !VAR1 {{!EXTRACT}} Passing Value of Extract to the new variable var2. TAB T=2 Creating a new tab in Firefox within same window URL GOTO={{!VAR1}} Going to the new URL (which is the link of the table constituent – referenced by its name) TAG POS=1 ATTR=TXT:Name TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Name TAG POS=1 ATTR=TXT:Email TAG POS=R1 ATTR=TXT:* EXTRACT=TXT Extracting Email ‘ONDIALOG POS=1 BUTTON=OK CONTENT= Commented out section- Used when Firefox gives a message to resubmit the data TAB T=1 Back to Tab 1 or where Form Inputs Search are present ‘BACK Commented out , instead of using back in same tab, we are moving across tabs to avoid submitting the search again and again SAVEAS FOLDER=* FILE=* Downloading the data into default folder, default format(File)Back to same Steps (Click here)
If you are interested in knowing more you can see the Google Docs
http://docs.google.com/View?id=dcvss358_335dg2xmdcp
The declining market for Telecommunication Churn Models
Users of Predictive Analytics within telecom sector can look into an interesting side effect of the iPhone – AT &T agreement. With Google also jumping into the market with it’s Droid – the new norms in Telecom agreements is lockedin contracts for consumers. While this is permitted by the telecom regulators as fair to competition- this also means that there is very little churn within these locked in contracts. This leads to further savings for the telecom provider allowing them to have higher profits and even share the profits by price decreases-
and thus the traditional bug bear of telecom analytics churn modeling is slowly losing importance to plain vanilla reporting or better data mining dashboard like solutions. Lower Churn , means also lower costs on analytics softwares to predict churn.
As competition within the 3G Mobile market ramps up due to Google’s entry and licensing with partners exclusively- the trend will likely increase for reduced churn due to locked in customers.Even existing mobile providers can offer discounts to lock in customers for not switching ( especially in Mobile Markets like India- where I have personally interacted with large players like Bharti) and China which has even bigger mobile market.
Ergo Lower need to buy softwares that predict churn-
See Below Image from TeraData’s Churn Model.
Twitter Cloud and a note on Cloud Computing
That’s what I use twitter for. If you have a twitter account you can follow me here
http://twitter.com/decisionstats
A couple of weeks ago I accidentally deleted many followers using a Twitter App called Refollow- I was trying to clean up people I follow and checked the wrong tick box-
so please if you feel I unfollowed you- it was a mistake. Seriously.
On Cloud Computing- and Google- rumours (
) are emerging that Google’s push for cloud computing is to turn desktop computing to IBM like mainframe computing . Except that there are too many players this time. Where is the Department of Justice and anti trust – does Amazon qualify for being too big in cloud computing currently.
Or the rumours could be spread by Microsoft/ Apple / Amazon competitors etc. Geeks are like that sometimes.
Creating Customized Packages in SAS Software
It seems there is a little known component called SAS Toolkit that enables you to create customized SAS commands.
I am still trying to find actual usage of this software but it basically can be used to create additional customization in SAS. The price is reportedly 12000 USD a year for the Tool Kit but academics could be encouraged to write thesis or projects in newer algols using standard SAS discounting. In addition there is no licensing constraint as of now to reselling your customized sas algol ( but check with Cary,NC or www.sas.com on this before you go ahead and develop)
So if you have an existing R package (with open source) and someone wants to port it to SAS language or SAS software, they can simply use the SAS Toolkit to transport the algorithm ( which to my knowledge are mostly open in R). Specific instances are graphics, Hmisc, Pl.ier or even lattice and clustering (like mclust) packages. or maybe even license it.
Citation-http://www.sas.com/products/toolkit/index.html
SAS/TOOLKIT® SAS/TOOLKIT software enables you to write your own customized SAS procedures (including graphics procedures), informats, formats, functions (including IML and DATA step functions), CALL routines, and database engines in several languages including C, FORTRAN, PL/I, and IBM assembler. SAS Procedures A SAS procedure is a program that interfaces with the SAS System to perform a given action. The SAS System provides services to the procedure such as:
- statement processing
- data set management
- memory allocation
SAS Informats, Formats, Functions, and CALL Routines (IFFCs) You can use SAS/TOOLKIT software to write your own SAS informats, formats, functions, and CALLroutines in the same choice of languages: C, FORTRAN, PL/I, and IBM assembler. Like procedures, user-written functions and CALL routines add capabilities to the SAS System that enable you to tailor the system to your site’s specific needs. Many of the same reasons for writing procedures also apply to writing SAS formats and CALL routines. SAS/TOOLKIT Software and PROC FORMAT You may wonder why you should use SAS/TOOLKIT software to create user-written formats and informats when base SAS software includes PROC FORMAT. SAS/TOOLKIT software enables you to create formats and informats that perform more than the simple table lookup functions provided by the FORMAT procedure. When you write formats and informats with SAS/TOOLKIT software, you can do the following:
- assign values according to an algorithm instead of looking up a value in a table.
- look up values in a Database to assign formatted values.
Writing a SAS IFFC
The routines you are most likely to use when writing an IFFC perform the following tasks:
- provide a mechanism to interface with functions that are already written at your site
- use algorithms to implement existing programs
- handle problems specific to the SAS environment, such as missing values.
SAS Engines SAS engines allow data to be presented to the SAS System so it appears to be a standard SAS data set. Engines supplied by SAS Institute consist of a large number of subroutines, all of which are called by the portion of the SAS System known as the engine supervisor.
However, with SAS/TOOLKIT software, an additional level of software, the engine middle-manager simplifies how you write your user-written engine. An Engine versus a Procedure To process data from an external file, you can write either an engine or a SAS procedure. In general, it is a good idea to implement data extraction mechanisms as procedures instead of engines. If your applications need to read most or all of a data file, you should consider creating a procedure—-but if they need random access to the file, you should consider creating an engine. Writing SAS Engines When you write an engine, you must include in your program a prescribed set of routines to perform the various tasks required to access the file and interact with the SAS System. These routines:
- open and close the data set
- obtain information about variables
- provide information about an external file or database
- read and write observations.
In addition, your program uses several structures defined by the SAS System for storing information needed by the engine and the SAS System. The SAS System interacts with your engine through the SAS engine middle-manager.
Using the USERPROC Procedure Before you run your grammar, procedure, IFFC, or engine, use SAS/TOOLKIT software’s USERPROC procedure.
- For grammars, the USERPROC procedure produces a grammar function.
- For procedures, IFFCs, and engines, the USERPROC procedure produces a program constants object file, which is necessary for linking all of the compiled object files into an executable module.
Compile and link the output of PROC USERPROC with the SAS System so that the system can access the procedure, IFFC, or engine when a user invokes it.
Using User-Written Procedures, IFFCs, and Engines After you have created a SAS procedure, IFFC, or engine, you need to tell the SAS System where to find the module in order to run it. You can store your executable modules in any appropriate library. Before you invoke the SAS System, use operating system control language to specify the fileref SASLIB for the directory or load library where your executables are stored. When you invoke the SAS System and use the name of your procedure, IFFC, or engine, the SAS System checks its own libraries first and then looks in the SASLIB library for a module with that name.
Debugging Capabilities The TLKTDBG facility allows you to obtain debug information concerning SAS routines called by your code, and works with any of the supported programming languages. You can turn this facility on and off without having to recompile or relink your code. Debug messages are sent to the SAS log. In addition to the SAS/TOOLKIT internal debugger, the C language compiler used to create your extension to the SAS System can be used to debug your program.
The SAS/C Compiler, the VMS Compiler, and the dbx debugger for AIX can all be used. NOTE: SAS/TOOLKIT software is used to develop procedures, IFFCs, and engines. Users do not need to license SAS/TOOLKIT software to run procedures developed with the software
March 2008 Level B support is effective beginning January 1, 2008 until December 31, 2009.March 2005 The SAS/C and SAS/C++ compiler and runtime components are reclassified as SAS Retired products for z/OS, VM/ESA and cross-compiler platforms. SAS has no plans to develop or deliver a new release of the SAS/C product.
The SAS/C and SAS/C++ family of products provides a versatile development environment for IBM zSeries® and System/390® processors. Enhancements and product features for SAS/C 7.50F include support for z/Architecture instructions and 64-bit addressing, IEEE floating-point, C99 math library and a number of C++ language enhancements and extensions. The SAS/C runtime library, optimizer and debugging environments have been updated and enhanced to fully support the breadth of C/C++ 64-bit addressing, IEEE and C++ product features.
Finally, the SAS/C and SAS/C++ 7.50.06 Cross-compiler products for Windows, Linux, Solaris and Aix incorporate the same enhancements and features that are provided with SAS/C and SAS/C++ 7.50F for z/OS.
Also see- http://support.sas.com/kb/15/647.html
News on R Commercial Development -Rattle- R Data Mining Tool
R RANT- while the European R Core leadership led by the Great Dane, Pierre Dalgaard focuses on the small picture and virtually handing the whole commercial side to Prof Nie and David Smith at Revo Computing other smaller package developers have refused to be treated as cheap R and D developers for enterprise software. How’s the book sales coming along, Prof Peter? Any plans to write another R Book or are you done with writing your version of Mathematica (Ref-Newton). Running the R Core project team must be so hard I recommend the Tarantino movie “Inglorious B…” for Herr Doktors. -END
I believe that individual R Package creators like Prof Harell (Hmisc) , or Hadley Wickham (plyr) deserve a share of the royalties or REVENUE that Revolution Computing, or ANY software company that uses R.
On this note-Some updated news on Rattle the Data Mining Tool created by Dr Graham Williams. Once again R development taken ahead by Down Under chaps while the Big Guys thrash out the road map across the Pond.
Data Mining Resources
Citation -http://datamining.togaware.com/
Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally. Rattle can be purchased on DVD (or made available as a downloadable CD image) as a standalone installation for $450USD ($560AUD), using one of the following payment buttons.
The free and open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) simply explains the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining. An electronic PDF version is available for a small fee from Togaware ($40AUD/$35USD to cover costs and ongoing development);
Other Resources
- The Data Mining Software Repository makes available a collection of free (as in libre) open source software tools for data mining
- The Data Mining Catalogue lists many of the free and commercial data mining tools that are available on the market.
- The Australasian Data Mining Conferences are supported by Togaware, which also hosts the web site.
- Information about the Pacific Asia Knowledge Discovery and Data Mining series of conferences is also available.
- A Data Mining course is taught at the Australian National University.
- See also the Canberra Analytics Practise Group.
- A Data Mining Course was held at the Harbin Institute of Technology Shenzhen Graduate School, China, 6 December – 13 December 2006. This course introduced the basic concepts and algorithms of data mining from an applications point of view and introduced the use of R and Rattle for data mining in practise.
- A Data Mining Workshop was held over two days at the University of Canberra, 27-28 November, 2006. This course introduced the basic concepts and algorithms for data mining and the use of R and Rattle.
Using R for Data Mining
The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.
R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.
R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.
Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).
See earlier interview-
http://decisionstats.wordpress.com/2009/01/13/interview-dr-graham-williams/
Holiday Fun: Analyzing Facebook Privacy for Ads
So you got a Facebook ID and ticked it in a hurry AND added in your work info. Bad Choice. Even small advertisers like me ( with 225 fans for Decisionstats) can see aggregate numbers of work info BEFORE even advertising.
This can lead to hilarious results-
See Screenshots below- AND note the numbers
1) 400 US females > age 18 work at IBM, SAP, Oracle or Microsoft AND are interested in Women

2) 2940 US females or males > age 18 work at IBM, SAP, Oracle or Microsoft AND are interested in Women

3) 480 US females > age 18 work at IBM, SAP, Oracle or Microsoft AND are interested in Men AND are married

4) 440 US males > age 18 work at IBM, SAP, Oracle or Microsoft AND are interested in Men

5) 40 US males > age 18 work at IBM, SAP, Oracle or Microsoft AND are interested in Men AND are married

Interested in males/females while giving out your work info AND your marital status. I hope these are ahem False Positives but seriously do you think these are violations of privacy or not.
Ps- i decided not to advertise after seeing the err statistics.
pps- This is meant to showcase lax ad related privacy for professionals rather than any individual preference or judgment.
PAWS goes to SF
Conference :Message on Linkedin groupof Decisionstats
Predictive Analytics World, Feb 16-17 in San Francisco
The agenda for Predictive Analytics World – Feb. 16-17 2010 in San Francisco – has been posted: http://www.pawcon.com/sanfrancisco/2010/agenda_overview.php
February’s PAW covers hot topics and advanced methods such as social data, uplift modeling (net lift), text mining, massively parallel analytics, in-cloud deployment, and innovative applications that benefit organizations in new and creative ways.
Be sure to register by December 18 for the Super Early Bird to save $400 off the Regular Price:
http://www.predictiveanalyticsworld.com/register.phpAnd take an additional $50 off the Super Early Bird with discount code: LIN150
Below is some more info – let me know if you have any questions.
-Eric Siegel, Conference Chair
———–
PAW-2010 includes 25 sessions across two tracks, so you can witness how predictive analytics is applied at 1-800-FLOWERS, Amazon.com, AT&T, BBC, Canadian Automobile Association, Charles Schwab, Continental Airlines, Deutsche Postbank, Google, Group RCI, IBM, PASSUR Aerospace, PayPal (eBay), Sun Microsystems, U.S. Army, Visa, Walmart Financial Services, and Younoodle, plus special examples from the U.S. government agencies CBP, NCMI, NGIC, NSA, and SSA.
Keynote speakers include Kim Larsen, Director Advanced Analytics at Charles Schwab, Andreas S. Weigend, Ph.D., Former Chief Scientist at Amazon.com, and Program Chair Eric Siegel, Ph.D., President of Prediction Impact and former Columbia University professor.
Predictive Analytics World is the business-focused event for predictive analytics professionals, managers and commercial practitioners, covering today’s commercial deployment of predictive analytics, across industries and across software vendors.
For more information, including three pre- and post-event workshops:
http://www.predictiveanalyticsworld.com
Thoughts on WPS, SAS , R
Just as unexpected market segments decided the Betamax and VHS debate. I find that the Small Business Segment is totally compatible with lower priced software and Self Development Kits.
What is interesting about WPS – SAS suit is that WPS now offers development for people wanting to write and extend their own code. That makes the SAS language theoretically as extensible as R packages.
http://www.teamwpc.co.uk/products/wps/modules/sdk
Develop Bespoke Language Items
Anyone with a familiarity of Assembler, C or C++ programming languages, can use the WPS SDK module to create bespoke custom language items for use by WPS.
Once you have created and compiled your own custom language items, you can freely distribute them to anybody who uses WPS on the same platform.
Catagory of Language Item Support
Below is a list of the type of language item that can be developed with WPS SDK.
Language Item Comment Informats Supported
Formats Supported
Functions Supported
Call Routines Supported
The ability to create the four language items indicated as supported in the table above is known as IFFC support (IFFC is derived from the first letter of the four language items).
Dependencies and Usage
The WPS Core module is required to use WPS SDK.
WPS SDK can be used on any of the supported platforms.
A standard third party C, C++ compiler and/or assembler are required to create the custom language items.
Once a language item has been compiled, neither WPS SDK or the third party compiler are required. Only WPS Core is needed to run the created language item.
WPS SDK and the language items that are created, can only be used with WPS versions 2.4 or higher.
On the negative side- I am not sure on who WPS is. The organizational structure is very secretive and I think that’s okay with a small private company but a true competitor to SAS may not want to lie in the shadows. I think that some of it is due to legal aggressive history in this field and whoever wins the case will end up creating a precedent.
Also some of R functionality is due to design of programming interfaces of command line. and modular structure of packages. Think of it as this way- what is SAS decided to license each proc individually rather than bundling procs as a new software. That would simply eliminate the not so successful procs and also give SAS much faster feedback from customers.
For the small business segment- offering on demand SAS ( or SAS SaaS!) could help reconcile both SAS licensing and cannibalization of revenue fears. So if you have less than 10 employees and less than 1 million revenue go to SAS on the Web hosted on an Amazon Ec2 or the 70 millon investment in data centre ( in Feb 2009 called cloud computing by SAS Institute).
An alternative is to offer SAS Learning Edition but ONLY on the web through a Citrix server. This enables tracking usage as most academics rarely need big data capability.
An alternative would be to track usage of individual language items like procs and macros ( if it can be enabled programmatically using a freq analysis of the logs and remote submission of counts in an anonymous way) Or do a web analysis of the SAS Online Doc.
What I find incredible is SAS documentation is both copyrighted by SAS and yet is freely available. When ideally all SAS papers and documentation should be accessible to SAS license holders only. Maybe by some secure way.
Curiously I am more than happy to try Sas enterprise guide (see below) however a bit of redesign with a JMP- Apple like interface.
How to be a BAD blogger?
Here are some tips to being a BAD blogger. This assumes that -
- you are intelligent enough to know what you speak ( NO- STUPID CLAUSE),
- are otherwise an interesting person in your offline life,
- have a good story to tell about yourself, your product or your company ( NO BORING CLAUSES),
- can spell-check (mostly) (NOT LAZY CLAUSE),
- can create a free account on wordpress.com or have access to a website where you can post material (NOT LAZY AND STUPID CLAUSES)
- AND otherwise have a desire to try and be a good blogger.
Step 1
Credibility
On the Internet everyone is an experienced expert in something.
Ways to wreck credibility-
- Offer ads from Adsense before your blog traffic crosses 100 average a day and maximum 200 visitors a day( not views).
- Take offers like free travel, books, software from people, products and companies- dont disclose that- and pump them up by flattering reviews.
- Scratch the back of a fellow blog monkey- Also known as you praise me in my blog- I will praise you in mine and we think we fooled everyone that we are just networking.
- Use shock words and images to differentiate.
- Offer ads from Non Adsense advertisers before your traffic crosses 500 average a day and maximum 1000 visitors a day( not views)
- Have only ONE advertiser and offer PRIME placement to news of it AND IGNORE corporate rivals completely.
- Claim to know people intimately whom you only know via Facebook Mafia Wars.
- Offer stuff to guest blogger and forget to follow up on the promise.
- Spam people on email and tell them how you are spamming them to HELP them with NEW stuff.
- Take money from sponsors, and free content from people. Call it aggregation and community. Pocket all the money
- Accept advertising from pornography. Claim you did not know what it was.
- Give tips on hacking websites. What goes around will never come around, right?
That should wreck your credibility completely. To build up your credibility , do the reverse of the above.
Hard Work
Hard work never killed anyone, but try to blog on boring stuff. Or on politics ,guns, gays and religion (preferably at the same time)
- Post a stupid picture of yourself in the about page and tell yourself people don’t care on photos anyway.
- Touch up your photo image by ADOBE Photoshop or Post an image 10 years younger (or 10 pounds thinner).
- Choose a bad theme. Like Violet background and yellow font.
- Post images of your kids or your vacation in a professional blog OR /AND post images of your computer or conferences in a personal blog.
- DO NOT SPELL CHECK.
- Use HTM4.0 . Pretend that CSS is a hit TV show.
- Pretend SEO , Tags and Categories is for others. DO NOT make it easy to search your blog.
WRITING
Coleridge was a drug addict. Poe was an alcoholic. Marlowe was killed by a man whom he was treacherously trying to stab. Pope took money to keep a woman’s name out of a satire then wrote a piece so that she could still be recognized anyhow. Chatterton killed himself. Byron was accused of incest. Do you still want to a writer – and if so, why?
Bennett Cerf ( from http://koti.mbnet.fi/pasenka/quotes/q-writ.htm#Writing%20is%20hell
- Write on politics and guns on a tech blog, or technology on a politics blog.
- Write dis jointed sentences in a hurry and claim it’s okay people wont notice anyways.
- Write only in text without ANY Images.
- Write 5 posts a day. or Write once in 5 weeks.
- Never explore VIDEO or AUDIO in your blog. Podcasts are for frozen peas.
- Have an ego bigger than your talent. Write about it.
- Be an expert in social media without crossing 1.5 years of blogging, or 25000 unique visitors. or 100,000 views on Internet. Twitter followers and Linkedin connections doesn’t count. Facebook Fans don’ count either.
- Generally make an ass of yourself by not editing or not proof reading your posts.
This should generally make sure that you become a BAD blogger, your blog traffic never crosses into two digits a day and you get back to work on your day job which you are probably good at.
If you do that, tell everyone blogs don’t matter in the 2010’s just as websites never mattered in the 1990’s, or Novels in the 1980’s, or TV in the 1950’s or Talking Pictures in the 1930’s.
Yup.
Born in the USA?
Here is some econometric search-ing I did
Using Google Public Data-and Wolfram Alpha and The Bureau of Labour Statistics
United States
| Data Series | Back Data |
May 2009 |
June 2009 |
July 2009 |
Aug 2009 |
Sept 2009 |
Oct 2009 |
|---|---|---|---|---|---|---|---|
| Unemployment Rate (1) | 9.4 | 9.5 | 9.4 | 9.7 | 9.8 | 10.2 | |
| Change in Payroll Employment (2) | -303 | -463 | -304 | -154 | (P) -219 | (P) -190 | |
| Average Hourly Earnings (3) | 18.53 | 18.54 | 18.59 | 18.66 | (P) 18.67 | (P) 18.72 | |
| Consumer Price Index (4) | 0.1 | 0.7 | 0.0 | 0.4 | 0.2 | 0.3 | |
| Producer Price Index (5) | 0.2 | 1.7 | (P) -1.0 | (P) 1.7 | (P) -0.6 | (P) 0.3 | |
| U.S. Import Price Index (6) | 1.7 | 2.7 | (R) -0.6 | (R) 1.5 | (R) 0.2 | (R) 0.7 | |
| Footnotes (1) In percent, seasonally adjusted. Annual averages are available for Not Seasonally Adjusted data. (2) Number of jobs, in thousands, seasonally adjusted. (3) For production and nonsupervisory workers on private nonfarm payrolls, seasonally adjusted. (4) All items, U.S. city average, all urban consumers, 1982-84=100, 1-month percent change, seasonally adjusted. (5) Finished goods, 1982=100, 1-month percent change, seasonally adjusted. (6) All imports, 1-month percent change, not seasonally adjusted. (R) Revised (P) Preliminary |
|||||||
| Data Series | Back Data |
3rd Qtr 2008 |
4th Qtr 2008 |
1st Qtr 2009 |
2nd Qtr 2009 |
3rd Qtr 2009 |
|---|---|---|---|---|---|---|
| Employment Cost Index (1) | 0.6 | 0.6 | 0.3 | 0.4 | 0.4 | |
| Productivity (2) | -0.1 | 0.8 | 0.3 | 6.9 | 9.5 | |
| Footnotes (1) Compensation, all civilian workers, quarterly data, 3-month percent change, seasonally adjusted. (2) Output per hour, nonfarm business, quarterly data, percent change from previous quarter at annual rate, seasonally adjusted. |
||||||
And also included are the average wages for salary of teachers and average salary per hour of some offshore prone industries
http://www.bls.gov/oes/2008/may/oes_nat.htm#b25-0000
http://www.bls.gov/oes/2008/may/oes_nat.htm#b11-0000
and
http://www.google.com/publicdata?ds=usunemployment&met=unemployment_rate&idim=state:ST370000:ST540000:ST510000&tdim=true
WHAT THEY PAY TEACHERS (MAY 2008)
| Education, Training, and Library Occupations top | ||||||
|---|---|---|---|---|---|---|
| Wage Estimates | ||||||
| Occupation Code | Occupation Title (click on the occupation title to view an occupational profile) | Employment (1) | Median Hourly | Mean Hourly | Mean Annual (2) | Mean RSE (3) |
| 25-0000 | Education, Training, and Library Occupations | 8,451,250 | $21.26 | $23.30 | $48,460 | 0.5 % |
| 25-1011 | Business Teachers, Postsecondary | 69,690 | (4) | (4) | $77,340 | 1.0 % |
| 25-1021 | Computer Science Teachers, Postsecondary | 32,520 | (4) | (4) | $74,050 | 1.0 % |
| 25-1022 | Mathematical Science Teachers, Postsecondary | 45,710 | (4) | (4) | $68,130 | 0.9 % |
| 25-1031 | Architecture Teachers, Postsecondary | 6,430 | (4) | (4) | $75,450 | 1.9 % |
| 25-1032 | Engineering Teachers, Postsecondary | 32,070 | (4) | (4) | $90,070 | 1.1 % |
| 25-1041 | Agricultural Sciences Teachers, Postsecondary | 10,000 | (4) | (4) | $77,770 | 1.6 % |
| 25-1042 | Biological Science Teachers, Postsecondary | 51,930 | (4) | (4) | $83,270 | 2.7 % |
WHAT THEY PAY THEMSELVES
| Management Occupations top | ||||||
|---|---|---|---|---|---|---|
| Wage Estimates | ||||||
| Occupation Code | Occupation Title (click on the occupation title to view an occupational profile) | Employment (1) | Median Hourly | Mean Hourly | Mean Annual (2) | Mean RSE (3) |
| 11-0000 | Management Occupations | 6,152,650 | $42.15 | $48.23 | $100,310 | 0.2 % |
| 11-1011 | Chief Executives | 301,930 | $76.23 | $77.13 | $160,440 | 0.5 % |
| 11-1021 | General and Operations Managers | 1,697,690 | $44.02 | $51.91 | $107,970 | 0.2 % |
| 11-1031 | Legislators | 64,650 | (4) | (4) | $37,980 | 1.1 % |
and JOBS PRONE TO SHORTAGE /OFFSHORING
| Computer and Mathematical Science Occupations top | ||||||
|---|---|---|---|---|---|---|
| Wage Estimates | ||||||
| Occupation Code | Occupation Title (click on the occupation title to view an occupational profile) | Employment (1) | Median Hourly | Mean Hourly | Mean Annual (2) | Mean RSE (3) |
| 15-0000 | Computer and Mathematical Science Occupations | 3,308,260 | $34.26 | $35.82 | $74,500 | 0.3 % |
| 15-1011 | Computer and Information Scientists, Research | 26,610 | $47.10 | $48.51 | $100,900 | 1.1 % |
| 15-1021 | Computer Programmers | 394,230 | $33.47 | $35.32 | $73,470 | 0.6 % |
| 15-1031 | Computer Software Engineers, Applications | 494,160 | $41.07 | $42.26 | $87,900 | 0.4 % |
| 15-1032 | Computer Software Engineers, Systems Software | 381,830 | $44.44 | $45.44 | $94,520 | 0.5 % |
| 15-1041 | Computer Support Specialists | 545,520 | $20.89 | $22.29 | $46,370 | 0.3 % |
| 15-1051 | Computer Systems Analysts | 489,890 | $36.30 | $37.90 | $78,830 | 0.4 % |
| 15-1061 | Database Administrators | 115,770 | $33.53 | $35.05 | $72,900 | 0.8 % |
| 15-1071 | Network and Computer Systems Administrators | 327,850 | $31.88 | $33.45 | $69,570 | 0.3 % |
| 15-1081 | Network Systems and Data Communications Analysts | 230,410 | $34.18 | $35.50 | $73,830 | 0.4 % |
| 15-1099 | Computer Specialists, All Other | 191,780 | $36.13 | $36.54 | $76,000 | 0.5 % |
| 15-2011 | Actuaries | 18,220 | $40.77 | $46.14 | $95,980 | 1.4 % |
| 15-2021 | Mathematicians | 2,770 | $45.75 | $45.65 | $94,960 | 1.7 % |
| 15-2031 | Operations Research Analysts | 60,860 | $33.17 | $35.68 | $74,220 | 0.8 % |
| 15-2041 | Statisticians | 20,680 | $34.91 | $35.96 | $74,790 | 1.5 % |
| 15-2091 | Mathematical Technicians | 1,100 | $18.46 | $20.24 | $42,100 | 2.7 % |
| 15-2099 | Mathematical Science Occupations, All Other | 6,600 | $26.44 | $31.55 | $65,630 | 4.3 % |
UNEMPLOYED IN THE USA (above)
BY STATE (below)
16 million people out of work. Give or take a million.
How can America pay 5.6 million people UNEMPLOYMENT BENEFITS
Keep another 10 million unemployed,
another 10 million only partially employed.
and still claim aggregate cost savings from offshoring jobs.
M2009 Interview Peter Pawlowski AsterData
Here is an interview with Peter Pawlowski, who is the MTS for Data Mining at Aster Data. I ran into Peter at his booth at AsterData during M2009, and followed up with an email interview. Also included is a presentation by him of which he was a co-author.
Ajay- Describe your career in Science leading up till today.
Ajay- How is life working at Aster Data- what are the challenges and the great stuff
Ajay- Do you think Universities offer adequate preparation for in demand skills like Mapreduce, Hadoop and Business Intelligence
Ajay- Describe some of the recent engineering products that you have worked with at Aster
Ajay- All BI companies claim to crunch data the fastest at the lowest price at highest quality as per their marketing brochure- How would you validate your product’s performance scientifically and transparently.
SAS and JMP : Visual Data Discovery
While R packagers have a lot to be proud of in the graphics packages of R, the truth of the matter is that the lack of GUI even for Graphical Analysis hinders the ease of usage in adopting R’s powerful graphics for statistical analysis. As a contrast , SAS and JMP have been combined together in the SAS Visual Data Discovery Environment
I really liked the GUI of JMP ( which is very rich in stats testing) and with the powerful data handling capabilities on the desktop of SAS, this is clearly an outstanding effort to create terrific graphics ( see below)
Note the combination of the two- Great Graphics WITH a GUI. in R the GUI that comes closest to matching JMP is R Commander, but it’s graphical capabilities are kept basic as it is not meant for replacement of the beloved Kommand prompt
( maybe an expanded plugin for graphics or hexabin would help)
It would be interesting to see an on demand Ec2 cloud hosted version of visual data discovery by SAS (with JMP as the front end) even for a limited pilot of six months and targeted at the SMB segment. Or a Salesforce.com application that integrates Salesforce.com data with the tests and standard procedures in SAS and JMP.
Note of Discontent- The JMP Website is terrible. It has a different font from the SAS Website ( they could atleast use the same CSS ) and overall is the worst part of the otherwise excellently elegant JMP. Hope they upgrade their website soon ( they havent done it this year atleast).
Scrennshot Citation-
http://www.sas.com/technologies/analytics/statistics/datadiscovery/index.html






Supported













