Tuesday, March 20, 2018

HATEOAS is useless? Or not used enough?

See Why HATEOAS is useless and what that means for REST.

The article provides a background leading up to these observations:
  • There are very few good tools to create a REST API using this style
  • There are no clients widely used to consume these types of APIs
The "useless" in the title is more like "not used enough."

There's a multi-part conclusion that may be more helpful if it's fleshed out further. For now, however, it appears that the big problems center around:
  • You still need to write Open Api Specifications (OAS, f/k/a Swagger). I don't think this is bad. The blog post makes it sound like a problem. I think it's essential.
  • You need to put versioning somewhere. The path is less than idea. I'm big on the Accept header containing application-specific MIME types. For example, application/vnd.com.your-name-here.app.json+v1. This doesn't strike me as a problem, either.
  • The whole approach is "closer to RPC than some REST lovers like to admit." I think this point revolves around the way JSON-RPC or SOAP involves some overheads above basic HTTP that are unhelpful. I don't think the "closer to RPC" follows logically from the lack of tooling for HATEOS, but it certainly could be true that a badly-done API might involve too many of the wrong kinds of overheads.
I think there's a hidden strawman here. The "automatic discovery" idea. I don't think this idea makes a lick of sense. Some people think it's implied (or required) by REST, and any failure to provide for fully-automated semantically rich discovery of an API is some kind of failure.

I don't think full semantic discovery is possible or even desirable. 
  • It's not possible because of the problem of assigning names and meanings to resources and verbs in an end-point. The necessary details can only be exposed with a semantically complete ontology and complex SPARQL queries into the ontology to find resources and end-points. 
  • It's not desirable because we replace a human-focused OAS with a complete ontology that has to be rigorously defined, and tested to be sure that all kinds of automated discovery algorithms can understand the provided details. And none of this addresses the actual application, it's all rich, detailed meta-description of the application.
I don't see why we're trying to replace people. API discovery is actually kind of hard. The resources, their relationships, and the verbs for getting or updating those resources involves an essentially difficult knowledge capture and dissemination problem. 

Friday, March 9, 2018

Python Interviews

The #Python Interviews book is out. Mike Driscoll interviewed a bunch of Python experts. And me, too. get 30% off the Amazon paperback version of the book using the code 30PYTHONhttps://goo.gl/5A3uhq

Here's a flavor of how this went:

Driscoll: So how did you end up becoming an author of Python books?
Lott: Most roles in my career more or less just happened to me, but becoming a writer was a conscious decision.
In this case, I had decided that there could be value in teaching the Python language and the associated software engineering skills. I started to collect notes for a book in 2002. By 2010, I had tried self-publishing several books on Python.
When Stack Overflow started, I was an early participant. There were many interesting Python questions. The questions showed gaps where more information was needed about Python specifically and software engineering in general. Over a few years, I answered thousands of questions about Python and somehow built up a large reputation.

Monday, March 5, 2018

Python Interviews -- Coming Soon from Packt

See https://www.packtpub.com/web-development/python-interviews

I'm honored.

I'll be studying what the other folks have to say in here. Being in the Python community means respecting other's views. And that means understanding them.

This looks like fun because it isn't *deeply* technical, it's about people and technology.

Tuesday, January 30, 2018

The SQL-based relational database isn't perfection? Whoa if true

Yes, there are people for whom document databases (and the file system) are confusing and weird.

I was sent this: Relational Algebra Is the Root of SQL Problems which is really brilliant and provides some helpful concrete examples of stuff SQL is really bad at.

The accompanying email was filled with nonsense about how important and world-changing SQL was.

I can't disagree. Back when disk was very expensive and very small, the SQL-based join strategies where essential for micro-managing every bit of data. Literally. Every Bit.

And then we would denormalize the structure for performance reasons. Because we always knew the SQL was terrible at a fairly large number of things.

Those days are behind us. We can now chose to use a document database, and make our lives simpler. Storage is relatively inexpensive, and the labor to normalize and denormalize data doesn't create significant value. The need to write stored procedures to turn a single conceptual operation into a bunch of inserts and updates was a symptom that this wasn't the best approach.

I've had many "But what about..." conversations regarding document databases.

"What about ad-hoc queries in SQL?"

- Do you really do these without writing a Python script or creating a Pandas dataframe? I doubt it. But. If you really think you'll do this, most document stores either support a modified SQL or Javascript. And yes, you hate Javascript, duly noted. I hate SQL, so we're even there.

"What about joins?"

- It's a space-saving technique. We don't need the overheads to save the space. The "update anomalies" still require careful design, and may lead to some decomposition of data into multiple documents. But the ruthless normalization shouldn't be seen as a requirement.

"What about the schema?"

- It's brittle and schema migration creates a lot of low-value labor. We can use Python JSONSchema to validate documents. See NoSQL Database doesn't Mean No Schema.

Transactional v. Analytical

It requires some care to understand the distinction between "transactional" and "analytical" uses for data. While folks try to leverage this distinction, it's a spectrum not a distinction.

A lot of data collection is a simple sequence of event documents. These have no sensible state change, so they're not really transactional. They are often created by concurrent processes where locking prevents corruption, so transactions *seem* helpful. Except, of course, the file system writes can be trivially sharded by process ID and then unified later. And all document databases serialize document writes from multiple client processes, so there's no value to writing a relational database.

Some data operations are properly stateful. By normalizing our tables, moving from consistent state to consistent state is made complex. Which requires a defined transaction as a work-around. And don't get me started on replication and two-phase commit as yet another layer of complexity on top of transactions.

A document database allows us to skip over 1NF. We can think of a document as being a row in a table where the data types are complex data structures involving mappings, sequences, strings, numbers, booleans, and nulls. (See JSON Schema.) A lot of multi-step SQL transactions are operations on several children of a common parent. If the parent was persisted as a single document, there wouldn't be multiple operations, an atomic MongoDB update operation can make complex rewrites to a complex document.

We can contrive a design where state changes must be coordinated and the data cannot be colocated in a single document. It's not difficult to stipulate enough requirements to make single documents difficult. The presence of these contrived requirement, however, doesn't suddenly invalidate document datastores for transactional data. In the SQL world, the idea of long-running and reversible long-running transactions has always been a horrible problem. Allowing stacked "undo" for the user means either creating a chain of Memento objects that can recover previous state, or having numerous flags and indicators on each record, allowing the state to be reversed. Some design problems are really hard. And the SQL model seems to make them harder.

The core ACID concepts of always consistent is -- in practice -- nonsense. As soon as we have to consider "isolation levels" and "read consistency" it becomes clear that there is no consistent state unless all transactions and queries are serialized via exclusive "whole database" locking. Competent DBA's know that long-running analytic queries performed concurrently with transactional updates can't use locking, and must tolerate inconsistencies in the database.

It's common practice to do data extracts so that analytic queries aren't working against the (inconsistent) transactional data. In this case, the frequency of extracts is the timing of "eventual consistency" promised by the BASE concept.

Bottom Line: Relational ACID rules are almost always broken in practice by read consistency rules and extracts to analytic databases. Analytical data is always based on eventual consistency expectations. The batch extracts means "eventually" is measured in hours. A document data store can often create consistency in milliseconds. (MongoDB primary failure, voting, and secondary promotion to primary relies on a 10-second heartbeat, so it takes time to discover and repair.)


A second email detailed their amazement (Amazing! Wow! Unbelievable! You Must Inform The World Of This!) that analytic processing of data is actually faster and simpler using the file system. The very idea of HDFS was so amazing that they were amazed.

Somehow, the idea of the raw filesystem as being really, really fast was the source of much amazement.

I'm glad they're making an effort to catch up. I'm glad they're seeing the relational model as a bad choice that has a limited number of use cases. Mostly, relational databases are useful for an organization can't write API's to handle the integrity issues.

To SQL or NoSQL? That's the database question | Ars Technica

Tuesday, January 23, 2018

PyCon 2018 Program Committee

I was "volunteered" by a colleague to help the program committee for PyCon 2018. I rarely think of myself as qualified for this kind of thing. Yes. I have six books on Python (with a seventh on the way) but the PSF folks are brilliant and dedicated and hard-working, and I'm just a slob.

Yes, I do get to help the community by reviewing almost 700 individual proposals. Some good. Some really good. Some which we *must* hear. 

The collateral benefit? 

Side reading.

My browser history is filled with things I hadn't known existed. 

Next time, I need to get started *before* the deadline so I can have a little more interaction with the authors. There were a few outlines where we could only discuss the possibility of making a change if the proposal was accepted. 

In particular, there seemed to be a *lot* of Machine Learning-Bayesian-Deep Learning-Recommender-Data Science pitches that had abbreviated outlines. They tend to all look alike to someone who's not an expert. Five bullet points: the author's background, the problem domain, ML (or modeling or whatever), a Jupyter notebook showing the results, and a conclusion.  Providing some distinct angle to the pitch (other than the problem domain) might help me understand them more fully. It seemed best to defer to the consensus on these.

I've been learning to live with my personal bias against meta-talks about building community. A presentation on community building at a community event seems redundant to me. But that doesn't mean they're not thorough, articulate talks that will be useful to others. Since I have a seat at the table, I'm biased. The Python tie-in feels weak, but our code of conduct (Open, Considerate, Respectful) means PyCon really is the place for more of this. Most importantly, they're objectively solid talks. (And -- as a member of the the over-represented old male nerd class, I do need to listen more.)

It's been enlightening. And the conference will rock.

Tuesday, January 16, 2018

I've decided on Windows -- Please help justify my choice

After many words, the email chain I received netted out to this:
  1. I can't teach myself data science on my crappy old Windows machine.
  2. I've decided to get a new Windows machine. Here are the specs.
My response was a mixture of incredulity and bafflement.

It appears that two things happened while I wasn't paying attention.
  1. Apple ceased to exist.
  2. The cloud ceased to exist.
I'm aware that many people think the Apple alternative is a non-starter. They are sure that after 40+ years selling computes, Apple is doomed, and we'll only have Windows on the desktop. Seriously.

Some people have farcical explanations for why Apple Cannot be Taken Seriously.

In this specific instance, there was a large investment in Python and Java that somehow couldn't be rebuild in Mac OS X. Details were explicitly not provided. Which is a way of saying there were no tangible "requirements" for this upgrade. Just specification numbers.

Import note. None of this involved "data" or "science." That was the baffling part. No objective measurement of anything. No list of software titles. No projects. No dataset sizes. Nothing.

The anti-cloud argument was even stranger than the anti-Mac argument.

Somehow, a super-large AWS server -- let's say it was a x1.16xlarge -- being used an hour a day (365 day*1 hr/day*$1.82/hr = $664) was deemed *more* expensive that a 64Gb 6-core home-based machine that would sit idle 23 hours each day.

The best part of $664/yr being *more* expensive?

Expert Judgement.

No "data". No "science". No measurement. No supporting details.

I wish I'd kept the email describing how someone who knew something said something about pricing.  It was marvelous Highest Paid Person's Opinion nonsense.

AFAIK, they were using 8,766 hours per year to compare AWS computing vs. at-home computing. This meant that an m5.4xlarge should be considered as costing $1,939 each year. Presumably because they'd never shut it off.

It included terms like "half-way decent performance."

There's a depth of wrongness to this that's hard to characterize beyond no "data" and no "science".