Is the XPath breaking the Documents?

Dr_Jeff · October 10, 2020, 8:43pm

Is it me or have I been noticing that, when documents are added, the text is displayed when the XPath is not added?

With XPath: https://edit.tosdr.org/documents/2895

Without XPath: Terms of Service; Didn't Read - Phoenix

Peepo · October 11, 2020, 9:46pm

So the crawler can extract documents without an XPath but if you notice, there’s a lot of extra and unneeded stuff in the document, that’s because it just takes the entire page. This wouldn’t normally be a problem but if a site changes it’s layout a little bit, even advertisements, it would require a re-crawl every time. On the document without the XPath for example, it has “SPECIAL OFFER: BUY A NEW iPHONE AT 0% APR. LEARN MORE.” Also, annotating with a lot of extra unneeded stuff can be annoying and it looks better. I fixed the Spotify document that you listed with an XPath so it should work now. Also, on Spotify there are several privacy policies in ToS;DR. ToSBack handles the archives so a new document isn’t required if it the terms change. If you notice the terms are out of date, just re-crawl the document and it should update automatically. If it doesn’t work, usually the site changed the location of its terms through a different url or XPath. This is better because existing points/annotations that were made are held onto and if the recrawl of the new document has the same quote then it maintains it’s status. So if a quote was approved and the terms change but the quote highlighted wasn’t changed then it is carried over and approved in the updated document. If a quote did change, then as of right now it goes to “draft”.