MS Excel’s Science SNAFUs — Bug or User Error?

Posted on August 26, 2016 by Richi Jennings in Main, Office, SQL Server with 0 Comments

Excel genetics errors

Users warned to use the right tool for the job

Microsoft Excel causes problems in genetic research. That’s the claim of three researchers from an Australian institute, who discovered almost 20% of data sets contained errors introduced by Excel.

The problem is Excel being “clever”—guessing the type of data it’s being asked to import. Unfortunately, some gene names look like dates. And other common data looks like floating-point numbers.

But so what? You don’t do genetics research, but what can we learn? In today’s IT Newspro, we look at the implication for IT pros—and it’s not good.

Your humble newswatcher curated these news nuggets for your entertainment. Not to mention: Fix the No Man’s Sky problem

What’s the craic? Aunty BBC’s anonymous scribblers say

Researchersclaim that the spreadsheet software automatically converts the names of certain genes into dates. [They] claimed the problem is present in “[20%] of papers” that collated data in Excel.

“Excel is able to display data and text in many different ways. Default settings are intended to work in most day-to-day scenarios,” a spokeswoman for [Microsoft said]. “Excel offers a wide range of options, which customerscan use to change the way their data is represented.”

The Excel gene renaming issuewas first cited by the scientific community back in 2004.The problem has “increased at an annual rate of 15%” over the past five years.

Please, let’s not call it Excelgate, OK? So Simon Sharwood snarkily says Excel hell:

The Baker IDI Heart & Diabetes Institute [published] a paper titled Gene name errors are widespread.It’s not hard to change the default format of Excel cells.Much of the problemis therefore between scientists’ ears, rather thanExcel.

The paper offers two workarounds. One is to use Google Sheets.The authors also cooked up scripts to find Excel errorson Sourceforge.

It’s not just mis-identifying gene names as dates. As noted by Peter Gothard—More ham-fisted fun with everyone’s favourite spreadsheet:

RIKEN identifiers are automatically convertedinto floating point numbers.For example from accession 2310009E13 to 2.31E+13.

The problem is obvious: you don’t mess with genetics. [It can lead] to Jurassic Park.We could be looking at an army of giant Clippysif we’re not too careful.

Bad for academia, but what about enterprise IT? A collection of commentators comment thuswise:

What surprises me most is that YYYY-MM-DD still isn’t the commonly used/accepted format. I work in a [European] company that does a lot of business with US companies.DD-MM-YY v. MM-DD-YY format ****ups have resulted in problems [very] often. And even now there are many formswhere DD-MM-YY is the only accepted format.

Another example: exported lists of usernames and passwords.The passwords were 8-character random hex strings. It worked fine 99.9% of the time – except for ones which happened to be like “123456e8” which Excel had munged into floating-point.

60,000 distinct 14-digit ID numbersall coming up as 5.42342E14 and when reformatted as numberas 54234200000000. You mean that’s not what was desired?

In mywork I get spammed by Excel worksheets.
1. Excelsheets with minutes from meetings! (use word)
2. Excelsheets with Rack Diagrams (use Rackbuilder)
3. Excelsheets with asset inventories (use some sort of Database)

Don’t even get me started on phone numbers [in] international format—with a ‘+’ at the beginning. Excel helpfully [sees] that as a formula.Clippy is dead, and come back to haunt us as a spreadsheet poltergeist.


So what do the researchers have to say for themselves? G’day, Mark Ziemann, Yotam Eren, and Assam El-Osta—Gene name errors are widespread:

Inadvertent gene symbol conversion is problematic.Supplementary [Excel] files are an important resourcethat are frequently reused.There is no way to permanently deactivate automatic conversionin MS ExcelLibreOffice Calc or Apache OpenOffice Calc. [But] Google Sheets did not convert any gene names to dates or numbers.

There are undoubtedly many more instances oferrors in journals outside of the 18 we screened.Conversion errorsshould be easy to avoid if researchers, reviewers, editorial staff and database curators remain vigilant.

Excel considered harmful. Sky falling. Film at 11. But “This isn’t an Excel error,” comments cschulz:

It is an operator error.Any data submitted for publication should be double-checkedThis study indicates lack of diligence [that] should have been caught ahead of publication.

It must be time for a “get off my lawn” riposte. Hear hearme0 out:

This problem increasing 15% is [a] result of a 15% increase in non-thinkers and dumb*****.There is no excuse!Sadly, that’s how the youth is today.

But is it really the users’ fault? No: PEBKAC, says WaffleMonster:

I do fault Excel itself because these errors are pervasive. They could have better structured the data importsor asked users for more feedback or have the import do a pass over the entire datasets checking for outliers that may suggest a different type.

When a critical mass is “doing it wrong,” it becomes pointlessto point fingers at users. Tools are supposed to be useful.

Buffer Overflow

More great links from Petri, IT Unity, Thurrott and abroad:


And Finally

Is No Man’s Sky a huge disappointment? Perhaps you missed the “secret” bit?
[warning: some naughty language]

You have been reading IT Newspro by Richi Jennings, who curates the best bloggy bits, finest forums, and weirdest websites… so you don’t have to. Hatemail may be directed to @RiCHi or [email protected]. Ask your doctor before reading. Your mileage may vary. E&OE.

Main image credit: Christoph Bock (cc:by-sa)


Tagged with , , , , , ,