Home » Posts tagged 'Spreadsheet'

Tag Archives: Spreadsheet

Interviews and Reviews: More R #rstats

I got interviewed on moving on from Excel to R in Human Resources (HR) here at http://www.hrtecheurope.com/blog/?p=5345

“There is a lot of data out there and it’s stored in different formats. Spreadsheets have their uses but they’re limited in what they can do. The spreadsheet is bad when getting over 5000 or 10000 rows – it slows down. It’s just not designed for that. It was designed for much higher levels of interaction.

In the business world we really don’t need to know every row of data, we need to summarise it, we need to visualise it and put it into a powerpoint to show to colleagues or clients.”

And a more recent interview with my fellow IIML mate, and editor at Analytics India Magazine

http://analyticsindiamag.com/interview-ajay-ohri-author-r-for-business-analytics/

AIM: Which R packages do you use the most and which ones are your favorites?

AO: I use R Commander and Rattle a lot, and I use the dependent packages. I use car for regression, and forecast for time series, and many packages for specific graphs. I have not mastered ggplot though but I do use it sometimes. Overall I am waiting for Hadley Wickham to come up with an updated book to his ecosystem of packages as they are very formidable, completely comprehensive and easy to use in my opinion, so much I can get by the occasional copy and paste code.

 

A surprising review at R- Bloggers.com /Intelligent Trading

http://intelligenttradingtech.blogspot.in/2012/10/book-review-r-for-business-analytics.html

The good news is that many of the large companies do not view R as a threat, but as a beneficial tool to assist their own software capabilities.

After assisting and helping R users navigate through the dense forest of various GUI interface choices (in order to get R up and running), Mr. Ohri continues to handhold users through step by step approaches (with detailed screen captures) to run R from various simple to more advanced platforms (e.g. CLOUD, EC2) in order to gather, explore, and process data, with detailed illustrations on how to use R’s powerful graphing capabilities on the back-end.

Do you want to write a review too? You can visit the site here

http://www.springer.com/statistics/book/978-1-4614-4342-1

 

Reading Google Docs using R Curl

 

and finally I can download my Google spreadsheet file using-

require(RCurl)

download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

url=”https://docs.google.com/spreadsheet/pub?key=0AtYMMvghK2ytcldUcWNNZTltcXdIZUZ2MWU0R1NfeWc&output=csv”

b <- getURL(url,cainfo=”cacert.pem”)

write.table(b,quote = FALSE, sep = “,”,file=”test.csv”)

 

Previously (for past 3 5 7 hours)-

The codes at http://blog.revolutionanalytics.com/2011/09/using-google-spreadsheets-with-r-an-update.html dont work thanks to the SSL authentication issue. and the packages at [http://www.omegahat.org/RGoogleDocs/] and [https://r-forge.r-project.org/projects/rgoogledata/] are missing in action. 

So I mixed the codes at http://blog.revolutionanalytics.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html and http://www.brocktibert.com/blog/2012/01/19/358/

and I get this error while using http://thebiobucket.blogspot.in/2012/03/r-function-to-read-data-from-google.html#more

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names

 

Google Refine

An interesting data cleaning software from Google at

https://code.google.com/p/google-refine/

From the page at

https://code.google.com/p/google-refine/wiki/UserGuide

The Basics

First, although Google Refine might start out looking like a spreadsheet program (Microsoft Excel, Google Spreadsheets, etc.), don’t expect it to work like a spreadsheet program. That’s almost like expecting a database to work like a text editor.

Google Refine is NOT for entering new data one cell at a time. It is NOT for doing accounting.

Google Refine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume.

To use Google Refine, think in big patterns. For example, to spot errors, think

  • Show me every row where the string length of the customer’s name is longer than 50 characters (because I suspect that the customer’s address is mistakenly included in the name field)
  • Show me every row where the contract fee is less than 1 (because I suspect the fee was entered in unit of thousand dollars rather than dollars)
  • Show me every row where the description field (scraped from some web site) contains “&” (because I suspect it wasn’t decoded properly)

To edit data, think

  • For every row where the contract fee is less than 1, multiply the fee by 1000.
  • For every row where the customer name contains a comma (it has been entered as “last_name, first_name”), split the name by the comma, reverse the array, and join it back with a space (producing “first_name last_name”)

To specify patterns, use filters and facets. Typically, you create a filter or facet on a particular column. For example, you can create a numeric facet on the “contract fee” column and adjust its range selector to select values less than 1. If the default facet doesn’t do what you want, you can configure it (by clicking “change” on the facet’s header). For example, you can create a text facet with on the same “contract fee” column with this expression:

  value < 1

It will show 2 choices: true and false. Just select true. Then, invoke the Transform command on that same column and enter the expression

  value * 1000

That Transform command affects only rows where the “contract fee” cell contains a value less than 1.

You can use several filters and facets together. Only rows that are selected by all facets and filters will be shown in the data table. For example, say you have two text facets, one on the “contract fee” column with the expression

  value < 1

and another on the “state” column (with the default expression). If you select “true” in the first facet and “Nevada” in the second, then you will only see rows for contracts in Nevada with fees less than 1.

Analogies

Databases

If you have programmed databases before (performing SQL queries), then what Google Refine works should be quite familiar to you. Creating filters and facets and selecting something in them is like performing this SELECT statement:

  SELECT *
  WHERE ... constraints determined by selection in facets and filters ...

And invoking the Transform command on a column while having some filters and facets selected is like performing this UPDATE statement

  UPDATE whole_table SET column_X = ... expression ...
  WHERE ... constraints determined by selection in facets and filters ...

The difference between Google Refine and databases is that the facets show you choices that you can select, whereas databases assume that you already know what’s in the data.

 

The Latest GUI for R- BioR

Once more a spanking new shiny software -

Bio7 is a integrated development environment for ecological modelling based on the Rich-Client-Platformconcept of the Java IDE Eclipse. The Bio7 platform contains several perspectives which arrange several views for a special purpose useful for the development and analysis of ecological models. One special perspective bundles a feature rich GUI (Graphical User Interface) for the statistical software R.
For the bidirectional communication between Java and R the Rserve application is used (as a backend to evaluate R code and transfer data from and to Java).
The Bio7 R perspective (see figure below) is divided into a R-Shell view on the left side (conceptual the R side) and a Table view on the right side (conceptual the Java side).
Data can be imported to a spreadsheet, edited and then transferred to the R workspace. Vice versa data from R can be transferred to a sheet of the Table view and then exported e.g. to an Excel or OpenOffice file.

and

General:

Built upon Eclipse 3.6.1.

Now works with the latest Java version! (Windows version bundled with the latest JRE release).

Removed the Soil perspective (now soils can be modeled with ImageJ (float precision). Active images can be displayed in the 3D discrete view (new example available).

Removed the database perspective and the plant layer. You can now built any discrete models without any plant layer.

Removed several controls in the Control view. Added the “Custom Controls” view. In addition ported the Swing component of the Time panel to Swt.

Deleted the avi to swf converter in the ImageJ menu.

Now patterns can be saved with opened Java editor source. If this file is reopened and dragged on Bio7 the pattern is loaded, the source is compiled and the setup method (if available) is executed. In this way model files can be used for presentations ->drag, setup and run. The save actions are located in the Speadsheet view toolbar.

More options available to disable panel painting and recording of values (if not needed for speed!).

New Setup button in the toolbar of Bio7 to trigger a compiled setup method if available.

Removed the load and save pattern buttons from the toolbar of Bio7. Discrete patterns can now be stored with the available action in the spreadsheet view menu.

New P2 Update Manager available in Bio7.

Updated the Janino Compiler.

New HTML perspective added with a view which embeds the TinyMC editor.

New options to disable painting operations for the discrete panels.

New option to explicitly enable scripts at startup (for a faster startup).

Quadgrid (Hexgrid)

Only states are now available which can be created in the “Spreadsheet” view menu easily. Patterns can be stored and restored as usual but are now stored in an *.exml file.

New method to transfer the quadgrid pattern as a matrix to R.

New method to transfer the population data of all quadgrid states to R.

ImageJ:

Update to the latest version (with additional fixes).

Fixed a bug to rename the image.

Thumbnail browser can now open images recursevely(limited to 1000 pics), the magnifiyng glass can be disabled, too.

Plugins can be installed dynamically with a drag and drop operation on the ImageJ view or toolbar (as known from ImageJ).

Installed plugins now extend the plugin menu as submenus or subsubmenus (not finished yet!).

Plugins can now be created with the Java editor. New Bio7 Wizard available to create a plugin template.

Compiled Java files can be added to a *.jar file with a new available action in the Navigator view (if you rightclick on the files in the Navigator). In this way ImageJ plugins can be packaged in a *.jar.

Floweditor:

Fixed a repaint bug in the debug mode of a flow (now draws correctly the active shape in the flow).

Resize with Strg+Scrollwheel works again.

Comments with more than one line works again.

New Test action to verify connections in a flow.

Debug mode now shows all executed Shapes.

Integrated more default tests (for the verification of a regular flow).

A mouse-click now deletes colored shapes in a flow (e.g. in debug mode).

Points panel:

Integrated (dynamic) Voronoi, Delauney visualization (with area and clip to rectangle action).

Points coordinates can now be set in double precision.

Transfer of point coordinates to R now in double precision.

Bio7 Table:

New import and export of Excel 2007 OOXML.

Row headers can now be resized with the mouse device.

R:

Updated R (2.12.1) and Rserve (0.6.3) to the latest version.

New help action in the R-Shell view.

New action to display help for R specific commands in the embedded Bio7 browser (which opens automatically).

New Key actions to copy the selected variable names to the expression dialog (c=cocatenate (+), a=add (,)).

New action to transfer character or numeric vectors horizontally or vertically in an opened spread (Table view) at selection coordinates.

Empty spaces in the filepath are now allowed under Windows if Rserve is started with a system shell or the RGUI (for the tempfile select a location in the Preferences dialog which is writeable) is started.This works also for the RGUI action.

Improved the search for the “Install packages” action (option “Case Sensitive” added).

API:

New API methods available!

And:

Many fixes since the last version!

 

Installation

Important information:

A certain firewall software can corrupt the Bio7 *.zip file (as well as other files).
Please ensure that you have downloaded a functioning Bio7 1.5 version. In addition it is also reported that a certain antivirus software detects the bundled R software (on Windows) as malware. Often the R specific “open.exe” is detected as malware. Please use a different scanner to make sure that the software is not infected if you have any doubts. For more details see:

http://r.789695.n4.nabble.com/trojan-at-current-development-version-td3244348.html

 

PySpread Magic

Python logo

Image via Wikipedia

Just working with PySpread- and worked on a 1 million by 1 million spreadsheet- Python sure looks promising for the way ahead for stat computing ( you need to

sudo apt-get install python-numpy python-rpy python-scipy python-gmpy wxpython*,

cd to the untarred bz2 file from http://pyspread.sourceforge.net/download.html,  (like

:~/Downloads$ cd pyspread-0.1.2

:~/Downloads/pyspread-0.1.2

sudo python setup.py install

)

http://pyspread.sourceforge.net/

by Martin Manns

 

about Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

Pyspread screenshot
features
  • Three dimensional grid with up to 85,899,345 rows and 14,316,555 columns (64 bit systems, depends on row height and column width). Note that a million cells require about 500 MB of memory.
  • Complex data types such as lists, trees or matrices within a single cell.
  • Macros for functionalities that are too complex for a single Python expression.
  • Python module access from each cell, which allows:
    • Arbitrary size rational numbers (via gmpy),
    • Fixed point decimal numbers for business calculations, (via the decimal module from the standard library)
    • Advanced statistics including plotting functions (via RPy)
    • Much more via <your favourite module>.
  • CSV import and export
  • Clipboard access
Pyspread screenshot

warning The concept of pyspread allows doing everything from each cell that a Python script can do. This powerful feature has its drawbacks. A spreadsheet may very well delete your hard drive or send your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets.

Since this is not the case for everyone (see discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can approve a file manually. Inspect carefully.

 

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

A Bold GNU Head

Image via Wikipedia

Whats new/ One more open source analytics package. Built like a spreadsheet with an ability to import a million cells-

From http://pyspread.sourceforge.net/index.html

about Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

Pyspread screenshot
features In pyspread, cells expect Python expressions and return Python objects. Therefore, complex data types such as lists, trees or matrices can be handled within a single cell. Macros can be used for functions that are too complex for a single expression.

Since Python modules can be easily used without external scripts, arbitrary size rational numbers (via gmpy), fixed point decimal numbers for business calculations, (via the decimal module from the standard library) and advanced statistics including plotting functions (via RPy) can be used in the spreadsheet. Everything is directly available from each cell. Just use the grid

Data can be imported and exported using csv files or the clipboard. Other forms of data exchange is possible using external Python modules.

In  order to simplify sparse matrix editing, pyspread features a three dimensional grid that can be sized up to 85,899,345 rows and 14,316,555 columns (64 bit-systems, depends on row height and column width). Note that importing a million cells requires about 500 MB of memory.

The concept of pyspread allows doing everything from each cell that a Python script can do. This may very well include deleting your hard drive or sending your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see the discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can trust a file manually. Inspect carefully.

Pyspread screenshot

requirements Pyspread runs on Linux, Windows and *nix platforms with GTK+ support. There are reports that it works with MacOS X as well. If you would like to contribute by testing on OS X please contact me.

Dependencies

Highly recommended for full functionality

  • PyMe >=0.8.1, Note for Windows™ users: If you want to use signatures without compiling PyMe try out Gpg4win.
  • gmpy >=1.1.0 and
  • rpy >=1.0.3.
maturity Pyspread is in early Beta release. This means that the core functionality is fully implemented but the program needs testing and polish.

and from the wiki

http://sourceforge.net/apps/mediawiki/pyspread/index.php?title=Main_Page

a spreadsheet with more powerful functions and data structures that are accessible inside each cell. Something like Python that empowers you to do things quickly. And yes, it should be free and it should run on Linux as well as on Windows. I looked around and found nothing that suited me. Therefore, I started pyspread.

Concept

  • Each cell accepts any input that works in a Python command line.
  • The inputs are parsed and evaluated by Python’s eval command.
  • The result objects are accessible via a 3D numpy object array.
  • String representations of the result objects are displayed in the cells.

Benefits

  • Each cell returns a Python object. This object can be anything including arrays and third party library objects.
  • Generator expressions can be used efficiently for data manipulation.
  • Efficient numpy slicing is used.
  • numpy methods are accessible for the data.

Installation

  1. Download the pyspread tarball or zip and unzip at a convenient place
  2. In case you do not have it already get and install Python, wxpython and numpy
If you want the examples to work, install gmpy, R and rpy
Really do check the version requirements that are mentioned on http://pyspread.sf.net
  1. Get install privileges (e.g. become root)
  2. Change into the directory and type
python setup.py install
Windows: Replace “python” with your Python interpreter (absolute path)
  1. Become normal user again
  2. Start pyspread by typing
pyspread
  1. Enjoy

Links

Next on Spreadsheet wishlist-

a MSI bundle /Windows Self Installer which has all dependencies bundled in it-linking to PostGresSQL ;) etc

way to go Mr Martin Manns

mmanns < at > gmx < dot > net

LibreOffice News and Google Musings

Tux, the Linux penguin

Image via Wikipedia

Official Bloggers on LibreOffice- http://planet.documentfoundation.org/

Note- for some strange reason I continue to be on top ranked LibreOffice blogs- maybe because I write more on the software itself than on Oracle politics or coffee spillovers.

LibreOffice Beta 2  is ready and I just installed it on Windows 7 – works nice- and I somehow think open Office and Google needs an  example to stop being so scary on cautioning—— hey,hey it’s a  beta – (do you see Oracle saying this release is a beta or Windows saying hey this Windows Vista is a beta for Windows 7- No right?)-

see screenshot of solver in  LibreOffice spreadsheet -works just fine.

We cant wait for Chromium OS and LibreOffice integration (or Google Docs-LibreOffice integration)  so Google starts thinking on those lines (of course

Google also needs to ramp up Google Storage and Google Predict API- but dude are you sure you wanna take on Amazon, Oracle and MS and Yahoo and Apple at the same time. Dear Herr Schmidt- Last German Guy who did that ,  ended up in a bunker in Berlin. (Ever since I had to pay 50 euros as Airline Transit fee -yes Indian passport holders have to do that in Germany- I am kind of non objective on that issue)

Google Management is busy nowadays thinking of trying to beat Facebook -hint -hint-

-buy out the biggest app makers of Facebook apps and create an api for Facebook info download and upload into Orkut –maybe invest like an angel in that startup called Diaspora http://www.joindiaspora.com/) see-

Back to the topic (and there are enough people blogging on Google should or shouldnt do)

-LibreOffice aesthetically rocks! It has a cool feel.

More news- The Wiki is up and awaits you at http://wiki.documentfoundation.org/Documentation

And there is a general pow-wow scheduled at http://www.oookwv.de/ for the Open Office Congress (Kongress)

As you can see I used the Chrome Extension for Google Translate for an instant translation from German into English (though it still needs some work,  Herr Translator)

Back to actually working on LibreOffice- if Word and Powerpoint is all you do- save some money for Christmas and download it today from
Follow

Get every new post delivered to your Inbox.

Join 843 other followers