When computer models designed by tech giants Alibaba and Microsoft this month surpassed humans for the first time in a reading-comprehension test, both companies celebrated the success as a historic milestone.

Luo Si, the chief scientist for natural-language processing at Alibabaโ€™s AI research unit, struck a poetic note, saying, โ€œObjective questions such as โ€˜what causes rainโ€™ can now be answered with high accuracy by machines.โ€

Teaching a computer to read has for decades been one of artificial intelligenceโ€™s holiest grails, and the feat seemed to signal a coming future in which AI could understand words and process meaning with the same fluidity humans take for granted every day.

But computers arenโ€™t there yet โ€” and arenโ€™t even really that close, said AI experts who reviewed the test results. Instead, the accomplishment highlights not just how far the technology has progressed, but also how far it still has to go.

โ€œItโ€™s a large stepโ€ for the companiesโ€™ marketing โ€œbut a small step for humankind,โ€ said Oren Etzioni, chief executive of the Allen Institute for Artificial Intelligence, an AI research group funded by Microsoft co-founder Paul Allen.

โ€œThese systems are brittle, in that small changes to paragraphs result in very bad behaviorโ€ and misunderstandings, Etzioni said. And when it comes to, say, drawing conclusions from two sentences or understanding implied ideas, the models lag even further behind: โ€œThese kind of implications that we do naturally, without even thinking about it, these systems donโ€™t do.โ€

The test involved Stanford Universityโ€™s Question Answering Dataset, a collection of more than 100,000 questions that has become one of the AI worldโ€™s top battlegrounds for testing how machines read and comprehend.

The models are given short paragraphs taken from more than 500 Wikipedia pages spanning a range of subjects, including Jacksonville, Fla; economic inequality; and the black death. Fed a paragraph about Super Bowl 50, for instance, the models are then asked which musicians headlined the halftime show.

The first test in August 2016, of a model created by researchers at Singapore Management University, lagged behind a measure of human performance โ€” people on crowdsourced systems, such as Amazonโ€™s Mechanical Turk, who earned money for taking surveys or completing small tasks. But after dozens of following tests, researchers this month submitted proof that their models had narrowly and finally beaten the humans โ€” an 82.6 for Microsoft Research Asiaโ€™s models, compared to the humansโ€™ 82.3.

As both Microsoft and the Chinese tech powerhouse Alibaba claimed first-in-AI victories, a flood of glowing media reports followed, positing that AI could not just read better than humans but would also, as Luo Si said in a statement, decrease โ€œthe need for human input in an unprecedented way.โ€

Microsoft said it is using similar models in its Bing search engine, and Alibaba said its technology could be used for โ€œcustomer service, museum tutorials and online responses to medical inquiries.โ€

But AI experts say the test is far too limited to compare with real reading. The answers arenโ€™t generated from understanding the text, but from the system finding patterns and matching terms in the same short passage. The test was done only on cleanly formatted Wikipedia articles โ€” not the wide-ranging corpus of books, news articles and billboards that fill most humansโ€™ waking hours.

Adding gibberish into the passages that a human would easily ignore often tended to confuse the AI, making it spit out the wrong result. And every passage was guaranteed to include the answer, preventing the models from having to process concepts or reason with other ideas.

Stephen Merity, a research scientist who works on language AI at cloud-computing giant Salesforce, said it was an โ€œamazing achievementโ€ but added that calling it superhuman was โ€œmadness.โ€ โ€œThereโ€™s no built-in ability for the model to determine or signal that it thinks the paragraph is insufficient to answer the question,โ€ he said. โ€œItโ€™ll always spit you back something.โ€

Even Pranav Rajpurkar, a Stanford AI researcher who helped design the Stanford test, said there remains โ€œactually quite a big jumpโ€ before machines can truly read and understand.

โ€œThe goal has always been to get to human-level performance, and itโ€™s been inching closer and closer there,โ€ Rajpurkar said.

The real miracle of reading comprehension, AI experts said, is in reading between the lines: connecting concepts, reasoning with ideas and understanding implied messages that arenโ€™t specifically outlined in the text.

In those realms, AI is still very much a work in progress. Computer models tested by the Winograd Schema Challenge, which asks them to comprehend the meaning of vague sentences that a human would nevertheless understand, have shown mixed results. Merity outlined one example in which todayโ€™s AI systems might still struggle to reasonably comprehend: asking the difference between a car โ€œfilled with gas,โ€ โ€œfilled with petrolโ€ and โ€œfilled with oranges.โ€

AI researchers said theyโ€™re eager to push on to new challenges of comprehension beyond basic Wiki-reading: The Allen Institute, for example, is training AI to answer SAT-style math problems and middle-school-level science questions.

But AI experts said people should be less concerned about losing their jobs to machines that thoughtfully read passages about the rain โ€” or anything else.

โ€œTechnically itโ€™s an accomplishment, but itโ€™s not like we have to begin worshiping our robot overlords,โ€ said Ernest Davis, a New York University professor of computer science and longtime AI researcher.

โ€œWhen you read a passage, it doesnโ€™t come out of the clear blue sky: It draws on a lot of what you know about the world,โ€ Davis said. โ€œWe really need to deal much more deeply with the problem of extracting the meaning of a text in a rich sense. That problem is still not solved.โ€