Web scraping is a complicated legal subject involving more than a dozen different laws. As such, it’s not possible to summarize all the legal issues associated with web scraping in just a few hundred words. That said, you can read the bolded sections here and the infographics in just a few minutes and learn the basics.
The good news for web scrapers is that the trend has been toward greater permissiveness with web scraping. The bad news is that the trend is not uniform across legal jurisdictions, and the overall history of web-scraping litigation has not been kind to scrapers.
[This article was last updated on November 19, 2020]
There are a few websites online that purport to answer the question of “whether web scraping is legal.” And way too many of those websites, with unwavering confidence and a complete absence of caution, provide clear and concise answers to that question that are laughably and dangerously false.
One such website claims to have a “three-part test” to determine whether web scraping is legal.
But a web scraper could follow that test and still violate dozens of state and federal laws—even potentially finding themselves in jail. The blog post would be certifiable legal malpractice—except it wasn’t written by a lawyer in the first place.
With the increasing importance of data collection from privately owned websites, web scraping has grown from a niche enterprise to a bona fide industry in the last decade.
As a lawyer (and a python programmer) who has clients in the web-scraping space, I figured the internet was long overdue for a practitioner’s guide to web-scraping law that was actually grounded in the law. With that, I took the time to read every web-scraping case and scholarly journal on the subject published in the last ten years.
What you see here is the end-product of that research.
2. A Brief Overview of Web-Scraping Laws in the US
Congress never drafted a law (and probably never will) to help web scrapers know which web scraping practices are legal and which are not. If we’re being honest, most members of Congress probably couldn’t tell you what web scraping is.
But frequently enough, web scraping becomes the source of very real business and personal disputes. Which means that courts have been forced to resolve those disputes using judicial frameworks designed for other purposes. Usually, the way this works is that someone’s website is scraped. That person or company hires an attorney, who then writes a cease and desist letter telling the web scraper to stop. Then, either the web scraper stops, or they don’t. If they don’t, then the lawyer often files a lawsuit alleging all sorts of legal claims.
Since no law directly applies to web scraping, and courts aren’t inclined to invent new laws, plaintiffs’ lawyers have been forced to get creative in trying to explain to courts why web scraping is a violation of existing laws. In so doing, lawyers have attempted to shoehorn a wide range of legal theories and frameworks into web-scraping litigation.
Perhaps none of this should come as a surprise to web scraping practitioners. But I suspect that few web scraping experts realize just how many different laws have been applied in web-scraping legal cases. To give a sense of just how wide of a net lawyers have cast on this issue, I figured that it might be worthwhile to list some of the laws that have been litigated against scrapers in just the last decade.
Here is a non-exhaustive list of legal claims levied against web scrapers over the last ten years.
- Violation of the Computer Fraud and Abuse Act (“CFAA”)
- Violation of California Penal Code Section 502
- Breach of Contract
- Copyright Infringement
- Trademark Infringement, Lanham Act Violations
- Trade Secret Misappropriation/ Theft of Trade Secrets
- Violation of the RICO Organized Crime Statute
- Violation of the Digital Millennium Copyright Act
- Trespass to Chattels
- False Advertising
- Unjust Enrichment
- Tortious Interference with a Contract
- Tortious Interference with Prospective Economic Advantage
- Violation of the Stored Communications Act
- Violation of the Can-Spam Act
I’ve listed these laws here for two reasons. First, to get your attention. And second, to underscore the fact that there isn’t a “three-step process” to determine if your web scraping practice is legal or not. Each of these laws has its own criteria to determine whether or not someone is violating the law. Which means there are at least 16 sets of multi-step analyses that would be required to determine if someone’s web scraping violates existing laws.
3. A Concise Summary of Current Web-Scraping Law
In the previous section, I listed the ways you can violate the law by web scraping. In this section, I’m going to try to summarize the converse: the types of web scraping practices that are legal.
In most jurisdictions in the United States, it is usually legal to scrape:
- Publicly available data that is not protected by an access or authentication barrier (hiQ Labs v. LinkedIn Corp. 2019);
- If the data that is scraped is not copyrighted (Compulife Software, Inc. v. Newman 2020) or if it is, if the scraping is fair use (Authors Guild v. Google, Inc. 2015);
- If the scraper does not misuse or misappropriate the trademarks or other intellectual property of the scraped website (Compulife Software, Inc. v. Newman 2020);
- If the website does not expressly through binding terms and conditions prohibit the scraping of data (Craigslist v. Instamotor 2017);
- If the totality of the scraping does not constitute the theft or misappropriation of a trade secret (Compulife Software, Inc. v. Newman 2020);
- If the scraping activity does not burden the website or server in such a way as to damage or negatively impact the website being scraped, its property, or its business (QVC v. Resultly 2016);
- If the scraping activity does not unjustly enrich the scraper at the expense of the scraped website (Infogroup, Inc. v. Database LLC 2015);
- If the scraping activity does not violate a privacy right of any person (Cooper v. Slice Technologies, Inc. 2017);
- If the scraping activity does not interfere with the contractual relationships of the scraped website (Infogroup, Inc. v. Database LLC 2015);
- If the scraping activity does not interfere with the prospective business relationships of the scraped website (DHI Group, Inc. v. Kent 2017), and
- If the scraping activity is not used to engage in deceptive or misleading communications with the public with respect to the data that has been scraped (Craigslist v. Instamotor 2017).
4. A History of Web-Scraping Law in the US
(if you don’t care about the history, feel free to skip ahead to 3(b)(4))
a. An introduction to the CFAA
The main law that has been applied to web scraping is the Computer Fraud and Abuse Act (“CFAA”). This law was enacted in 1984 under the Comprehensive Crime Control Act and then expanded as the CFAA in 1986. It was designed as the first “anti-hacking” federal law, and it became a law a half-decade before the advent of the World Wide Web. Since web browsing was not yet a thing at the time, it’s safe to say that web scraping wasn’t involved in the calculus of drafting this law. But when courts look to federal laws that might apply to web scraping, the CFAA has proven to be the best fit.
At a high level, as applied to scrapers, one violates the CFAA if he or she: “intentionally accesses a computer without authorization or exceeds authorized access” in obtaining information from any protected computer (18 USC 1030(a)(2)(C)) or if he or she “knowingly and with intent to defraud, accesses a computer without authorization, or exceeds authorized access, and by means of such conduct furthers the intended fraud and obtains anything of value . . . .”
It is worth noting that the CFAA has both a criminal and a civil component, meaning that violation of the law can involve lawsuits and criminal consequences for the persons involved.
b. A brief history of the CFAA and web scraping
The early days of the CFAA’s application to web scraping cases was not kind to web scrapers. In this decade there are nine legal opinions that address the liability of web scrapers under the CFAA, and all nine of them end badly for the defendant web scrapers.
In these cases, the end result was a preliminary injunction against the defendant (meaning that the web scraper was legally prevented from conducting further scraping activities) or a motion to dismiss was denied and a lawsuit was allowed to proceed against the defendant web scraper.
When a motion to dismiss is denied, that often results in the parties settling their legal claims out of court, and that means we don’t get any further instruction about what happened in those cases. But we can infer from the circumstances that things didn’t end well for the web scrapers who were attempting to defend their conduct in court.
Things started to improve for web scrapers in 2009, when courts began to narrow the application of the CFAA to situations where web scrapers evaded some sort of technical barrier rather than merely accessing a web site with the use of a web scraper.
There were two cases that kickstarted this trend. LVRC Holdings LLC v. Brekka (2009) and US v. Nosal (Nosal I) (2012). While neither of these cases implicated web scraping per se, they changed the legal interpretation of the CFAA in ways that greatly benefited web scrapers.
Both cases involved employees who accessed employer databases to do things that their employers did not approve of. In these cases, courts found that the employee did not violate the CFAA when using his employer’s database to further personal rather than professional interests. Their reasoning was that the employees had lawful access and authorization to use those databases at the time. And while their conduct may have violated other laws for other reasons, because they were accessing a database legally, the conduct wasn’t a violation of the CFAA.
At this point, courts begin to emphasize the distinction between “use” and “access” in assessing liability under the CFAA. If a defendant has the right to use the computer, website, or software in question, then their use will likely not be deemed a violation of the CFAA. If the defendant exceeds an “access” barrier of some kind, then there is a good chance their conduct will be considered a violation of the law.
With this in mind, many companies copped on to the fact that they could rescind scrapers’ access to a website or certain data, and then, if they continued to access it, invoke the CFAA.
This was the basic fact pattern of both Craigslist v. 3Taps, Inc. and Facebook, Inc. v. PowerVentures, Inc. In both cases, scrapers took data from websites that sent cease and desist letters and put in IP access barriers to block the offending scrapers’ access. When the defendants continued access the sites in contravention of these requests, Craigslist and Facebook sued. In both instances, Craigslist and Facebook were successful in their legal claims.
Under this regime, courts generally began to acknowledge that exceeding an “access” restriction was sufficient to create liability under the CFAA. But as incumbents who hosted and published large databases started to realize, this was a system that could be manipulated.
3.hiQ Labs, Inc. v. LinkedIn Corp.
This sets the stage for hiQ Labs, Inc. v. LinkedIn Corp., likely the most important case in contemporary web-scraping jurisprudence.
Everyone knows about LinkedIn. hiQ Labs is a startup founded in 2012, and, by the time they clashed heads with LinkedIn, had about $15 million in funding. hiQ Labs sells information to customers about their workforces that hiQ generates through analysis of data on LinkedIn users’ publicly available profiles. At the time of the suit, it had two main products: “Keeper,” designed to help employers learn which employees are at the greatest risk of leaving and “Skill Mapper,” which provides a summary of the skills in an employer’s workforce.
The facts of the litigation are as follows: LinkedIn and hiQ Labs had known about each other for years, even attending conferences together and often interacting with each other’s products and services.
In 2017, LinkedIn launched a new product, “Talent Insights,” that was arguably a competitor of hiQ Labs.
According to the court, here’s what happened next:
In May 2017, LinkedIn sent hiQ a cease-and-desist letter, asserting that hiQ was in violation of LinkedIn’s User Agreement and demanding that hiQ stop accessing and copying data from LinkedIn’s server. The letter stated that if hiQ accessed LinkedIn’s data in the future, it would be violating state and federal law, including the Computer Fraud and Abuse Act (“CFAA”), the Digital Millennium Copyright Act (“DMCA”), California Penal Code § 502(c), and the California common law of trespass. The letter further stated that LinkedIn had “implemented technical measures to prevent hiQ from accessing, and assisting others to access, LinkedIn’s site, through systems that detect, monitor, and block scraping activity.”
HiQ Labs refused to comply. And, in flipping the script of most web scraping cases, whereby a company that gets scraped sues the web scraper for copying and accesses its data, hiQ Labs filed for a preliminary injunction against LinkedIn, “seeking injunctive relief based on California law and a declaratory judgment that LinkedIn could not lawfully invoke the CFAA, the DMCA, California Penal Code § 502(c), or the common law of trespass against it. HiQ also filed a request for a temporary restraining order, which the parties subsequently agreed to convert into a motion for a preliminary injunction.”
Much to the surprise of many, hiQ Labs was granted its preliminary injunction at the Federal District Court in Northern California. LinkedIn then appealed the ruling to the 9th Circuit Court of Appeals (one of eleven appellate courts that rank only below the US Supreme Court in terms of federal authority), because, at least on the surface, the ruling seemed to contradict prior law.
The Ninth Circuit again ruled in favor of hiQ Labs. According to the Ninth Circuit, the CFAA does not prohibit the scraping of publicly available data, unless such data is blocked from the public by some sort of general access-restriction barrier. Specifically, the court said:
[W]hen a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA. The data hiQ seeks to access is not owned by LinkedIn and has not been demarcated by LinkedIn as private using such an authorization system.
This is a big deal, because it seems to indicate, at least under the CFAA, whatever a human can access is now legal for a scraper to access.
Also important in this case: The Court decided that LinkedIn’s having sent hiQ Labs a cease and desist letter and revoking access to hiQ Lab’s specific IP address was not sufficient to create liability under the CFAA. While neither the district court nor the appellate court specifically overturned its rulings in Power Ventures or 3Taps, the court makes it clear that revoking access to a website that is otherwise “open to all comers” is not sufficient to create liability under the CFAA. The Ninth Circuit, in rejecting that logic, expressed concern that if a company could revoke access to publicly available data and then invoke a criminal statute against someone who tried to access otherwise publicly available data, this might allow website owners to block access for “discriminatory, anticompetitive, or other improper reasons.”
That’s an important precedent.
This case is indeed a landmark victory for web scraping and the persons and businesses who engage in it. It significantly limits the criminal and civil liability for scraping, at least in those jurisdictions that follow its precedent. But if you’re inclined to read this case as a sort of “olly olly oxen free” for web scraping, please don’t. While the Court narrowed the scope of the CFAA—the most important federal law related to web scraping—it did not close the door for scraped web sites to pursue other legal claims. According to the Court, “[w]e note that entities that view themselves as victims of data scraping are not without resort, even if the CFAA does not apply; state law trespass to chattels claims may still be available. And other causes of action, such as copyright infringement, misappropriation, unjust enrichment, conversion, breach of contract, or breach of privacy, may also lie.”
4. Circuit Split
(Warning: This section is fairly technical)
Now that we’ve outlined this narrative of the CFAA’s history, here’s the bad news: Not all courts follow this narrative. Some cases have reached a totally different conclusion on the law.
Some courts adopt a “narrow” interpretation of the “exceeds authorized access” provision of the CFAA, and some courts adopt a “broad” approach.
The court in hiQ Labs outlined the narrow approach. According to the hiQ Labs opinion: “[t]he rule of lenity favors our narrow interpretation of the “without authorization” provision in the CFAA. The statutory prohibition on unauthorized access applies both to civil actions and to criminal prosecutions — indeed, “§ 1030 is primarily a criminal statute. Because we must interpret the statute consistently, whether we encounter its application in a criminal or noncriminal context, the rule of lenity applies.” As we explained in Nosal I, we therefore favor a narrow interpretation of the CFAA’s “without authorization” provision so as not to turn a criminal hacking statute into a “sweeping Internet-policing mandate.” (citations omitted).
There are eleven circuit courts below the Supreme Court that have authority over their jurisdictions. When those courts disagree with one another, it’s called a “circuit split.” That’s what’s currently happening in terms of the interpretation of the CFAA.
The Second, Fourth, Sixth, and DC circuits have adopted the narrow approach to the “exceeds authorized access” provision of the CFAA, which is consistent with the Ninth Circuit in hiQ Labs v. Linkedin Corp. The First, Fifth, Seventh, and Eleventh Circuits, by contrast, have adopted a broad approach to the same provision of CFAA, which means that they are more likely to find that a person’s conduct “exceeds authorized access” in more circumstances. In the Third and Tenth Circuits, district courts have previously adopted the narrow approach, but the circuit courts have yet to provide definitive guidance. (Cloudpath Networks, Inc. v. SecureW2 BV 2016); (USG Insurance Services, Inc. v. Bacon, 2016).
Courts that have adopted the broad approach seem more likely to reach conclusions that are not favorable to web scrapers and make decisions that would contradict the conclusions of hiQ Labs. For example, in Southwest Airlines v. Roundpipe, out of the Northern District of Texas in 2019 (the Fifth Circuit), the court decided that violating a terms-of-use agreement was sufficient to trigger the CFAA. According to the court: “Southwest’s complaint states plausible claims for breach of contract, for violations of both the Computer Fraud and Abuse Act (“CFAA”) and section 33.02 of the Texas Penal Code (“THACA”), and for unfair competition, trademark infringement, and dilution under the Lanham Act. Specifically, the court concludes that Southwest’s complaint states a plausible claim for breach of contract because Southwest’s complaint not only identifies the existence of a valid contract (Southwest’s use agreement) but it also explains how the defendants’ use of automated scraping tools breached the contract and caused damage to Southwest.” (citations omitted). District Courts in the Seventh Circuit also appear to reach conclusions that are contrary to the hiQ Labs opinion. (“CouponCabin alleges that it revoked PriceTrace’s access to the CouponCabin website, and that PriceTrace continued to access the website without authorization. This is sufficient to plausibly allege violation of § 1030(a)(2)(C), which prohibits mere unauthorized access.” CouponCabin, Inc. v. PriceTrace, LLC, 2019).
These decisions seem to flatly contradict the court’s opinion in hiQ Labs (though it is worth noting that both of these cases were decided before the Ninth Circuit opinion was published in hiQ Labs (though both were decided after the district court opinion was published)). As such, we do not have a clear case of a federal court rejecting the Ninth Circuit’s logic in hiQ Labs yet.
What all that means is that for businesses, the Second, Third, Fourth, Ninth, and Tenth Circuits are likely to be more favorable places for web scrapers to litigate. The First, Fifth, Seventh, and Eleventh Circuits are likely to be more favorable for businesses pursuing claims against web scrapers.
It is possible that the Supreme Court might act to resolve this circuit split soon. LinkedIn has appealed the decision in hiQ Labs, Inc. v. LinkedIn to the Supreme Court and one member of the court has asked hiQ Labs to provide a brief in response, which was filed only in the last couple of months. But as of yet, the Supremes have not resolved this dispute.
The key takeaway in this section, is that at least for now, where scraping cases are decided matters as much as anything else in deciding what the outcome will be.
5. Copyright Law and Web Scraping
The next-most commonly litigated issue in web scraping litigation is that of copyright.
Copyright law is broad enough of a legal topic to consume an entire legal career. As such, it is beyond the scope of this article to try to encapsulate the entire field into a few paragraphs.
At a high level, copyright law provides protections for certain creative works. Copyright protection includes coverage for books, movies, music, and software. In the context of web scraping, it usually comes up in the context of digital content.
Here, we’ll focus our discussion on two issues that are central to copyright law and web scraping. Namely, copyright’s treatment of 1) facts and 2) the fair use of copyrighted materials.
Facts, by themselves, cannot be copyrighted, at least in the United States. Certain creative arrangements of facts, however, can be copyrighted, depending on exactly how creative the arrangements are.
Exactly where a certain collection of facts falls on this continuum is a complicated question beyond the scope is this article.
Web scrapers are generally looking to collect large amounts of data, which are, in essence, just discrete facts. In theory, that should mean that copyright law often would not apply to web scraping activities. But different people and companies do different things with those data that impacts how courts perceive whether they might be entitled to proprietary rights. The more innovative and distinctive the organization of the data, the more likely that courts will see a taking and reproduction of that data as a violation of the law.
To provide one example from a recent influential copyright case:
Even a cursory comparison of the two segments suggests that the defendants’ work copied material from nearly every page of the copyrighted work. The defendants’ code includes nine of the eleven basic sections of Compulife’s code, arranged in almost exactly the same order. The defendants’ code even reproduces idiosyncratic elements of Compulife’s work, like treating New York as two separate jurisdictions—one for business and another for non-business—an element of the code that was obsolete by the time the defendants copied it.
Compulife Software Inc. v. Newman, Court of Appeals, 11th Circuit 2020.
In this case, the defendant scraped data that was ultimately just a collection of facts. According to some people’s intuitions, that might seem uncopyrightable and open to the public. But it was clear in this instance that the plaintiff had put time and effort into its organization of the data. In the eyes of the law, that unique organization matters.
Unique organization of data is often entitled to copyright protection. If you scrape data that is entitled to that protection, you may be engaging in copyright infringement.
b. Fair Use
The next key concept with web scraping and copyright is that of fair use.
Copyright law permits you to quote from an article you found on the internet. It does not permit you to plagiarize the entire article.
Between those two extremes, where the line is drawn, in terms of when you can and cannot use copyrighted materials, is determined by the law of “fair use.”
The Copyright Act provides:
[T]he fair use of a copyrighted work … for purposes such as criticism, comment, news reporting, teaching …, scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
Fox News Network, LLC v. TVEyes, Inc., 2nd Circuit 2018
Let’s look at a few cases to see how this plays out in real life.
c. Search Engine Case Law
In the relatively early days of the World Wide Web, courts had to assess how copyright law worked in the context of search engines. Search engines are in essence high-powered web crawlers themselves, and they do take and reproduce copyrighted works in some form.
If you Google Picasso’s Guernica or Frost’s “The Road Less Traveled,” you’ll find a digital image of Picasso’s famous painting and the content of Frost’s most famous poem. Given that those are both proprietary forms of art, courts had to decide what was allowed and what was not allowed for search engines that reproduce other content that is covered by copyright.
In Kelly v. Arriba Soft Corp., in 2003, the 9th Circuit again led the way on deciding what was appropriate. The facts of that case were:
The plaintiff, Leslie Kelly, is a professional photographer who has copyrighted many of his images of the American West. Some of these images are located on Kelly’s web site or other web sites with which Kelly has a license agreement. The defendant, Arriba Soft Corp., operates an internet search engine that displays its results in the form of small pictures rather than the more usual form of text. Arriba obtained its database of pictures by copying images from other web sites. By clicking on one of these small pictures, called “thumbnails,” the user can then view a large version of that same picture within the context of the Arriba web page.
Here, the court decided that Arriba Soft’s use of thumbnails in its search engine was fair use, because it was transformative and not a substitute for the original artwork. But it left open the question of whether a larger, higher quality reproduction of her artwork would have been permissible.
More than a decade later, in Authors Guild v. Google, Inc. 2nd Circuit 2015, a court was again asked to reassess whether a digital copying was fair use. In that case, plaintiffs, who were authors of published books, sued for copyright infringement. Through Google’s “Library Project” and its “Google Books” project, acting without permission of the rights holders, Google made digital copies of tens of millions of books. Google then scanned the digital copies and established a publicly available search function.
The Court concluded:
Google’s making of a digital copy to provide a search function is a transformative use, which augments publicly available knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them.
Both of these courts focused heavily on the question of whether the products were competing with the original work and whether the search engine’s conduct was “transformative,” which is a question of whether the new use is substantially different from the original rights holder’s use.
In both of these cases, courts sided with the companies who copied the original work and then presented it to the public in a new and original format.
d. Competing Media Cases
Courts tend to be more skeptical of scraping activities when the scrapers create a product that can be used as a substitute for the original.
In Associated Press v. Meltwater US Holdings, Inc. in 2013, defendant Meltwater scraped news articles on the web and, among other things, provided excerpts of those stories, including many AP stories, in reports it sent to its subscribers. AP argued that this was a violation of its copyrights on those news stories.
Here, in deciding that Meltwater’s conduct was a violation of AP’s copyrights and not a fair use, the court heavily emphasized that the service Meltwater provided was a substitute for and not a complement to AP’s articles. There was a very low click-through rate to actual AP articles.
What’s more, “through its use of AP content and refusal to pay a license fee, Meltwater ha[d] obtained an unfair commercial advantage in the marketplace and directly harmed the creator of expressive content protected by the Copyright Act.”
A few years later, the 2nd Circuit reached a very similar conclusion in Fox News Network, LLC v. TVEyes, Inc., a case where the defendant copied Fox News content and converted it into a searchable product. The use was transformative, but it was also a substitute for the original, and that weighed heavily against the defendant when the court determined that it had violated Fox News’ copyrights when copying its content.
e. Compulife Software, Inc. v. Newman
The most important case for the law of copyright and web scraping is Compulife Software Inc. v. Newman. This case was decided by the 11th Circuit on May 20, 2020, so its reverberations in web scraping law are only starting to reveal themselves.
The judicial opinion in this case starts out with the following caveat:
“Warning: This gets pretty dense (and difficult) pretty quickly.”
Judges, particularly elite federal judges, aren’t known to be modest or shy about their intellectual abilities. When a federal judge starts off their judicial opinion with a comment like that, you know you’re diving into the deep end.
The facts of this case, at the highest level, are as follows:
Compulife and the defendants are direct competitors in a niche industry: generating life-insurance quotes. Compulife maintains a database of insurance-premium information—called the “Transformative Database”—to which it sells access. The Transformative Database is valuable because it contains up-to-date information on many life insurers’ premium-rate tables and thus allows for simultaneous comparison of rates from dozens of providers. Most of Compulife’s customers are insurance agents who buy access to the database so that they can more easily provide reliable cost estimates to prospective policy purchasers. Although the Transformative Database is based on publicly available information—namely, individual insurers’ rate tables—it can’t be replicated without a specialized method and formula known only within Compulife.
The defendants in this case operate a website called “BeyondQuotes” that is a competitor of Compulife’s service. At one point the owners of this site hired a hacker to take Compulife’s data for use in the defendants’ software. Compulife claimed that the defendants didn’t even bother to produce their own quotes but simply reproduced Compulife’s data. According to the court:
Natal used this scraping technique to create a partial copy of Compulife’s Transformative Database, extracting all the insurance-quote data pertaining to two zip codes—one in New York and another in Florida. That means the bot requested and saved all premium estimates for every possible combination of demographic data within those two zip codes, totaling more than 43 million quotes. Doing so naturally required hundreds of thousands of queries and would have required thousands of man-hours if performed by humans—but it took the bot only four days…. Compulife alleges that the defendants then used the scraped data as the basis for generating quotes on their own websites. The defendants don’t disagree, except to claim that they didn’t know the source of the scraped data but, rather, innocently purchased the data from a third party. Moses Newman testified, however, that he watched Natal collect the requested data in a manner consistent with a scraping attack. David Rutstein also testified that when the defendants instructed Natal to obtain insurance-quote information, they fully intended for her to “extract data” from an existing website.
In the original trial, the magistrate judge made the following conclusions.
Although the judge found, as an initial matter, that Compulife had a valid copyright in the text of its HTML source code and that its Transformative Database was a protectable trade secret, he ruled in favor of the defendants. In doing so, he held that Compulife hadn’t met its burden to prove—as it had to in order to make out a copyright-infringement claim—that the defendants’ copied code was “substantially similar” to its own and, further, that the defendants hadn’t misappropriated any trade secrets.
Compulife appealed to the Eleventh Circuit, who then concluded that the magistrate judge had made several clear errors of law—and that the scrapers may indeed have engaged in illegal conduct in copying the database.
Specifically, the appeals Court concluded that if a plaintiff whose content has been scraped can show that it has a valid copyright, the burden shifts to the web scraping defendant to prove that its conduct was permissible in copying the copyrighted material.
Although we haven’t previously done so, we now clarify that after an infringement plaintiff has demonstrated that he holds a valid copyright and that the defendant engaged in factual copying, the defendant bears the burden of proving—as part of the filtration analysis—that the elements he copied from a copyrighted work are unprotectable.
This is a landmark decision that is not good for web scrapers. In the law, burden-shifting is a big deal. Usually, the plaintiff bears the burden demonstrating harm or a violation of the law and must provide evidence to prove its case. If it’s a 50-50 proposition, the plaintiff loses. If there is some evidence of misconduct but the evidence is inconclusive, the plaintiff loses. But with burden-shifting, all that goes out the window. With burden-shifting, in limited circumstances, courts assume that the defendant has engaged in wrongdoing even before the evidence is presented. That will prove to be a serious liability for web scrapers who scrape copyrighted materials.
Here the 11th Circuit is saying that as long as a plaintiff can show 1) that it has a valid copyright and 2) that the defendant copied its copyrighted material, that the defendant automatically bears the burden of proving its case that that the material it copied was unprotectable. That is not an easy thing to do.
As much as hiQ Labs, Inc. v. LinkedIn Corp. was an important decision in favor of web scrapers in limited the liability of web scrapers under federal law, this case is nearly as damaging to companies that engage in web scraping in potentially expanding the liability for web scrapers under federal law in another context. Now, it behooves a web scraper to know and understand whether the websites it is scraping are, in fact, copyrighted.
And it behooves those who design software and create copyrightable databases that might be susceptible to scraping to register a copyright.
Because if a website scrapes the data of a copyrighted website, the burden now belongs to the defendant to show that their conduct was permissible, and that’s a hard thing to prove. And even if the defendant can prove its case, it would likely need to go to trial to do so. That’s a very expensive endeavor.
6. Breach of contract
Most companies’ websites have a terms-of-use or a terms-of-service (“ToU”) agreement. And most of them, within their terms, prohibit the automated collection, scraping, and crawling of data without permission.
Those ToU agreements purport to be enforceable contracts. And they say that you can’t scrape their data. If that’s the case, if you scrape their data, you’ve breached an enforceable contract. Intuitively, visiting a website or clicking a box feels different from signing a normal contract.
The question, then, is how courts view these ToU. To determine whether a ToU contract breach can support a legal claim, we need to assess whether ToU are, in fact, enforceable contracts.
a. Browsewrap and Clickwrap Agreements
Courts lump ToU agreements into two broad categories: browsewrap agreements and clickwrap agreements.
A browsewrap ToU is one where the agreement appears on the website somewhere, usually through a hyperlink on the bottom or the top of the page, and the user is presumed to have assented to the agreement just by using the website. There is no affirmative step where the user clicks or checks a box that they have agreed to the terms.
A clickwrap ToU is one where, at some point during the registration or use of the website, the user affirmatively clicks on a button or checks a box to acknowledge that they have agreed to the terms.
The reality is in most instances, few people read these agreements, regardless of how they encounter them. But, because clickwrap agreements have an actual step whereby users affirmatively represent that they have actually read the agreement, as a rule, courts tend to favor clickwrap agreements over browsewrap agreements when it comes to enforcing their terms.
A few courts in web-scraping cases have dismissed breach of contract claims against a web scraper because they deem browsewrap agreements to be unenforceable, absent evidence that the web scraper had actual or constructive knowledge about its terms. For example, in Cvent, Inc. v. Eventbrite, Inc., a federal court in the Eastern District of Virginia dismissed a breach of contract claim against a web scraper because the plaintiff’s ToU was a browsewrap agreement and the plaintiffs had not “pled sufficient facts to plausibly establish that defendants Eventbrite and Foley were on actual or constructive notice of the terms and conditions posted on Cvent’s website.” The court reached a similar conclusion in Alan Ross Machinery Corporation v. Machinio Corporation in Illinois in 2019.
Perhaps surprisingly, however, there are many cases in the context of web-scraping that have concluded that browsewrap agreements are, in fact, enforceable against a web scraper.
For example, in DHI Group. v. Kent, a court in the southern district of Texas refused to dismiss a breach of contract claim against a company that scraped a website that had a browsewrap agreement, because they themselves had a similar agreement on their own website with similar terms that also prohibited scraping. The court said they therefore knew or should have known that web scraping was not permitted on plaintiff’s website. Courts reached similar conclusions in Snap-One Business Solutions v. O’Neil & Associates, Inc. in Ohio, in 2010, and in CouponCabin v. Savings.com, Inc., in Indiana, in 2017.
Thus, while web scrapers likely have a better chance of defending a breach of contract claim with a browsewrap agreement than with a clickwrap agreement, depending on the facts and circumstances, as well as the jurisdiction in which the case is litigated, victory in such a case is by no means certain.
On the surface, it would seem that breach of contract may often be the easiest legal claim for plaintiffs to pursue against web scrapers. Most websites have ToU that prohibit web scraping. If the ToU is enforceable, then anyone who scrapes the website breaches the contract.
But relatively few web-scraping cases have gone to trial where the plaintiff relies exclusively or mostly on breach of contract as the basis for their claim.
Why is that?
A couple of reasons:
First, most ToU agreements have mandatory arbitration clauses, which forces both parties to go to arbitration, rather than trial. Most larger companies do this to avoid class-action lawsuits. But these arbitration provisions cut both ways. They keep plaintiffs in class actions suits from bringing claims against companies, but they also limit bigger companies’ ability to pursue certain legal claims under their own agreements.
That means that if a plaintiff pursues a legal claim under a breach of contract theory for violation of a ToU, the defendant can force the parties to go to arbitration, rather than litigation, which a Facebook or LinkedIn or might not want.
Another reason companies don’t pursue breach of contract claims against web scrapers is that, in many instances, there are no damages, or only speculative damages, associated with a breach of contract as a result of web scraping.
To recover money from a defendant in a breach of contract claim, you not only have to show that there was a breach of the contract, but that you lost money or otherwise suffered “damages” as a result of the breach.
In some breach-of-contract cases, this is easy to prove. For example, if you’re a supplier of widgets, and a company was contractually obligated to buy five million dollars’ worth of widgets from you on a certain date, if that company breached the contract to buy widgets from another company instead, without justification, and you were unable to sell the widgets elsewhere, it would likely be fairly easy to prove that you lost five million dollars as a result of their breach of contract.
But when someone scrapes data from your website, damages are likely much more difficult to prove. Perhaps, if the scraper is a competitor, you can show you lost business as a result of scraping and taking of data. This is why so many web-scraping cases (Cvent, hiQ Labs, Compulife, CouponCabin, DHI v. Kent) arise in the context of a competitor taking another competitor’s data.
But if the web scraper isn’t a competitor, often, companies that get scraped don’t lose anything as a result of the scraping. If that’s the case, then, even though the web scraper might have technically breached the ToU, the end-result of litigation, if the plaintiff can’t show monetary damages, is that plaintiff would not be entitled to recover anything or only nominal damages (such as $1) as a result of the breach. That’s almost certainly not worth the time, cost, hassle, and stress of litigation.
That said, there are a few notable counter-examples to this. For example, in Craigslist, Inc. v. Instamotor, Inc., in 2017, Craiglist recovered $31,052,314 in damages from a web scraper who took their data and used it in a blitz-marketing campaign. The primary basis for their claim was breach of contract.
Sometimes, when a company sends a cease and desist letter threatening breach of contract, it’s an idle threat. And sometimes, as Craigslist, Inc. v. Instamotor, Inc. shows, it’s not.
7. State Tort Laws and Web Scraping
When the Ninth Circuit in hiQ Labs concluded that the CFAA did not apply, it was careful not to limit the legal remedies against web scrapers. It said “even if the CFAA does not apply; state law trespass to chattels claims may still be available. And other causes of action, such as copyright infringement, misappropriation, unjust enrichment, conversion, breach of contract, or breach of privacy, may also lie.”
Of these legal claims, trespass to chattels, misappropriation, and conversion all fall into the broad category of “state tort law” claims.
Tort law claims, broadly speaking, are civil legal claims that arise from some form of wrongdoing other than a breach of contract. Many of these laws date back centuries, sometimes going back to English law.
But these laws were drafted in abstract language that make it so that they might apply in a variety of circumstances that go beyond the scope of how they were originally intended. And while they may not have originally been intended to deal with web scraping, if a lawyer can convince a judge that the elements of archaic tort laws apply to modern practices of web scraping, a web scraper can be entitled to pay someone to compensate them for that conduct.
Because there are 50 different state laws that may (or may not) apply to web scraping, I’m going to keep this section limited to a brief, high-level explanation of these laws and how they might apply to web scraping.
a. Trespass to Chattels
Trespass, as most people know, is when you enter someone’s property without permission. Trespass claims have appeared in web scraping cases, but when they do, they usually get dismissed, because there is no actual physical intrusion in web scraping.
More commonly, plaintiffs in web scraping cases allege something called “trespass to chattels.” Trespass to chattels happens when someone intentionally interferes with someone else’s personal property, and that conduct causes harm.
Although “seldom employed as a tort theory” in modern law, trespass to chattels has enjoyed a resurgence as a result of web scraping. See, e.g. eBay, Inc. v. Bidder’s Edge, Inc., 2000. To establish a claim, a plaintiff must show that a web scraper “intentionally and without authorization interfered with plaintiff’s possessory interest in the computer system” and that the conduct “proximately resulted in damage to plaintiff.” Id.
In the context of web scraping, trespass to chattels has been applied most often when a web scraper engages in queries or requests that burden the scraped web site’s servers, slow down the speed of a website, or otherwise causes disruption to a scraped website’s services.
Conversion is similar to trespass to chattels, “[t]he difference is that conversion entails a more serious deprivation of the owner’s rights such that an award of the full value of the property is appropriate.” QVC, Inc. v. Resultly, 2016.
Because the standard to prove conversion is higher than that of trespass to chattels, these claims rarely succeed. That said, there have been examples of plaintiffs winning cases at trial on claims of conversion. (In re Roselli, Bankr. Court, WD North Carolina 2013 (Where jury found in favor of plaintiffs on its UDTPA claim and common law conversion claim. In total, the jury ultimately awarded plaintiffs $4,155,793.00)).
c. Misappropriation of Trade Secrets
Misappropriation of trade secrets is legal jargon for stealing another company’s prized business secrets. What constitutes a “trade secret” varies by state and is often difficult to pin down. But when a competitor scrapes another company’s website, courts often deem that scraping a theft of trade secrets.
This is another situation where the court’s opinion in Compulife Software, Inc. v. Newman, 2020 is illustrative.
The magistrate judge was correct to conclude that the scraped quotes were not individually protectable trade secrets because each is readily available to the public—but that doesn’t in and of itself resolve the question whether, in effect, the database as a whole was misappropriated. Even if quotes aren’t trade secrets, taking enough of them must amount to misappropriation of the underlying secret at some point. Otherwise, there would be no substance to trade-secret protections for “compilations,” which the law clearly provides. (citations omitted)
Compulife Software, Inc. v. Newman, 2020
To determine whether scraping constitutes the theft of a trade secret, the court said that the “truly determinative questions” are 1) “whether the block of data that the defendants took was large enough to constitute appropriation” and “whether the means they employed were improper.”
Thus, the applicability of this law to web scraping is largely a question of scope and conduct. Scrape a little and you’re probably ok. Scrape a whole lot of your competitor’s website and you might not be. Additionally, further evidence such as using a fake email or collecting information under false pretenses are examples of conduct that might be more likely to lead to a theft of trade secrets claim.
d. Tortious Interference with a Contract or Prospective Business Relationship
The exact elements of this legal claim vary from state to state, but generally, to pursue this legal claim, someone must show:
[T]he existence of a valid business relationship or expectancy, (2) knowledge by the interferer of the relationship or expectancy, (3) an unjustified intentional act of interference on the part of the interferer, (4) proof that the interference caused the harm sustained, and (5) damage to the party whose relationship or expectancy was disrupted.
Infogroup, Inc. v. Database LLC, 2015.
This usually comes up in the context of a competitor scraping another competitor’s data. If a business creates a database that customers pay $100 a month to access, and a competitor scrapes that data and offers the same or a similar service for $50 a month, if the business that got scraped can show that the scraping resulted in a loss of revenue and that the scraping conduct was inappropriate or impermissible, this could lead to a tortious interference with a contract or business relationship claim.
8. Unjust Enrichment
One of the hardest laws to predict is unjust enrichment. It does not stem from contract law or tort law.
Basically, when one business profits from the work of another and a court thinks it seems unfair, and no other law applies, sometimes they’ll call that unjust enrichment.
Businesses put time and labor into collecting and compiling data. It usually takes much less time to scrape that data. States vary wildly in how they apply unjust enrichment, but in states that do so liberally, web scraping companies may be susceptible to unjust enrichment claims. Again, your risk of liability varies not just depending on the conduct and the context, but also based on the venue and jurisdiction where the dispute is resolved.
9. The Present and Future of Web Scraping Law
There are a few areas of law that I expect to occupy a more prominent part of web-scraping jurisprudence in the coming years. I think there’s some value in practitioners keeping these top-of-mind as courts work through these issues.
a. Privacy laws
Given how much privacy concerns are in the news and how often web scrapers collect personally identifiable information, privacy laws have not been frequently the subject of web scraping litigation.
While some plaintiffs have pursued claims against web scrapers under the Electronic Communications Privacy Act (“ECPA”) and California’s Invasion of Privacy Act (see, e.g., Cooper v. Slice Technologies, Inc. 2017, this has been rare.
Privacy issues were discussed in the legal opinions of hiQ Labs and Facebook, Inc. v. Power Ventures, Inc., but in both cases, it was a side issue.
That’s not really something that’s been dealt with fully by the judicial system yet. But I’d be very surprised if it wasn’t something that was litigated repeatedly over the next decade.
While beyond the scope of this article, it is worth noting that the EU has much more stringent laws with the scraping of personally identifying information, and the EU has imposed fines on web scrapers who violate those laws.
b. Antitrust Issues
This is another issue that was addressed, but not fully resolved, in the hiQ Labs vs. LinkedIn Corp. case. As anyone who works in the world of data knows well, a few select, prominent, and highly profitable companies own and/or control nearly all the data. Facebook, LinkedIn, Netflix, Alphabet, Inc. (Google), Microsoft, Apple—they have access to treasure troves of data that many other companies could use to profitable and useful ends—if they had access to the data. The same is often true at a lesser scale in niche industries. (See, e.g., In Re Dealer MGMT. Systems Antitrust Litigation, 2018). Five years ago, Facebook’s data was Facebook’s data and anyone arguing to the contrary would have had no success arguing to the contrary in court. But I think the tide is turning on these issues.
According to the Court in hiQ Labs vs. LinkedIn Corp.:
Although there are significant public interests on both sides, the district court properly determined that, on balance, the public interest favors hiQ’s position. We agree with the district court that giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use—risks the possible creation of information monopolies that would disserve the public interest.
Now, the Court never says the word “antitrust” in its opinion. But it certainly seems like antitrust issues with data control lie at the center of its reasoning when it decided this case.
This is something that, again, I expect to see much more in the coming years.
The Ninth Circuit revisited antitrust issues with hiQ Labs in 2020. You can read more about those developments in an article I wrote at Professor Eric Goldman’s blog.
Traditionally, courts have been very reluctant to consider novel antitrust arguments in resolving commercial disputes. But now, any attorney defending a web scraping case against an industry leader in the next decade would be remiss if they didn’t consider an antitrust argument in their defense.
c. First Amendment Issues
Another legal issue that’s becoming more prominent in the context of web scraping cases is First Amendment law.
Most notably, in 2020, in Sandvig v. Barr, a group of academic researchers brought a “pre-enforcement challenge” to argue that the CFAA chills their right to free speech.
The basic facts are as follows:
Plaintiffs are academic researchers who intend to test whether employment websites discriminate based on race and gender. In order to do so, they plan to provide false information to target websites, in violation of these websites’ terms of service. Plaintiffs bring a pre-enforcement challenge, alleging that the Computer Fraud and Abuse Act (“CFAA”), 18 U.S.C. § 1030, as applied to their intended conduct of violating websites’ terms of service, chills their First Amendment right to free speech. Without reaching this constitutional question, the Court concludes that the CFAA does not criminalize mere terms-of-service violations on consumer websites and, thus, that plaintiffs’ proposed research plans are not criminal under the CFAA. The Court will therefore deny the parties’ cross-motions for summary judgment and dismiss the case as moot.
Sandvig v. Barr, DC Cir. 2020
The court tiptoed around the issue of whether the language of the CFAA, namely, language that arguably prohibited web scraping, limited the plaintiffs’ first amendment rights. Ultimately, they decided that the CFAA didn’t apply. But if it had applied, there is a chance they would have struck down part of the law on that basis. Of course, we’ll never know, but it is worth noting that Sandvig v. Barr was not the first case where this argument was presented. (See, e.g., EF Cultural Travel BV v. Zefer Corp. 2003; hiQ Labs, Inc. v. LinkedIn Corp.).
This is an issue that academics, scholars, and journalists are trying to probe in the judicial system. Laws that prohibit the most effective way to collect and process large amounts of data do potentially inhibit free speech. As such, we can expect these attempts to explore the boundaries of free speech, First Amendment law, and web scraping to continue.
d. Biometric Data and Web Scraping
This is an issue that is working its way through the legal system, but where we do not yet have guidance.
It is currently working its way through the legal system, because the company Clearview AI is embroiled in a variety of class-action lawsuits where proceedings are still ongoing. See, Calderon v. Clearview AI. Inc., 2020; Mutnick v. Clearview AI, Inc., 2020)
These class actions allege violations of the Illinois Biometric Information Privacy Act (“BIPA”), as well as federal constitutional claims under 42 U.S.C. § 1983 (on the theory that Clearview AI qualifies as a state actor).
These cases stem from:
Clearview’s conduct in: (a) allegedly scraping billions of facial images from the Internet; (b) performing facial scans of those images; and (c) creating a biometric database that allows users of the database to immediately identify a member of the public merely by uploading a person’s image to the database.
Again, there is no federal law dealing with collection of biometric data. Which is why these cases are coming forward mostly in Illinois or elsewhere alleging violations of Illinois law, which does have such a law.
I will not predict how these cases will unfold. But it is worth flagging that there have been class-actions filed in federal court raising issues with scraping biometric data, and the legality of such conduct is very much in doubt.
10. Summary and Conclusion
Yuval Noah Harari, in his book “21 Lessons for the 21st Century,” wrote:
So we had better call upon our lawyers, politicians, philosophers, and even poets to turn their attention to this conundrum: how do you regulate the ownership of data? This may well be the most important political question of our era. If we cannot answer this question soon, our sociopolitical system might collapse. People are already sensing the coming cataclysm. Perhaps this is why citizens all over the world are losing faith in the liberal story, which just a decade ago seemed irresistible.
Web scraping is just one piece of the puzzle when it comes to determining how we regulate the ownership of data. But it is an important piece of that puzzle. If courts adopt a liberal perspective in allowing companies to access and use publicly available data on privately owned websites, that levels the playing field for startups and up-and-comers to mine that data for innovation. But unfettered openness can also lead to fraud, theft, and privacy violations. On the other hand, if courts broadly restrict access to publicly available data on privately owned websites, that’s a huge victory for incumbents, who already have outsized influence in the American and world economies.
I believe that courts have already begun to develop an increasingly reasonable framework to deal with these issues. I have tried my best to summarize and explain that framework here. And, with time and with the dedication of well-intentioned attorneys, practitioners, judges, my hope is that courts can provide practitioners with better, clearer, and simpler guidance in the years to come.