Partnership between Phoenix and Open Terms Archive: sourcing document text

Hi, everybody!

As part of an ongoing effort to improve the data quality in Phoenix, we will be gradually transitioning to an implementation that sources document text from Open Terms Archive (OTA -, another admirable open source project.

We’re hoping this will help address the problem of how to keep document texts clean and updated in Phoenix.

Phoenix and OTA store substantially similar data, but OTA’s data, i.e. terms text, is of a much higher quality.

Terms are collected and monitored using a more advanced system of crawlers, cleaned of unrelated content, combined if scattered across numerous pages, and normalized if stored in PDFs.

In OTA, contributions are maintained collaboratively and openly by volunteers, and, in some cases, researchers. Like Phoenix, OTA’s code and processes are completely open-source: Open Terms Archive · GitHub

We will be testing a portion of this implementation in the coming weeks. Using Facebook as an example, Phoenix will source Facebook’s Privacy Policy from OTA’s archive:

If, like Facebook, terms are to be sourced from OTA, there will be an explanation of that fact within the service page on Phoenix. This should not affect annotations, nor the ongoing contribution of points.

If a certain service’s terms are not yet stored by OTA, Phoenix will default to the terms fetched and stored by Phoenix crawlers and contributed by Phoenix users.

The longer-term goal is to more fully integrate Phoenix and OTA, so that Phoenix users may contribute document terms directly to the OTA archives, where their contributions will then be monitored by OTA’s crawlers, thus broadening OTA’s archives and improving the data quality of Phoenix.

I hope that makes sense. Please let us know if you have any questions or doubts.


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.