Articles

Setup Firecrawl with Docker and Traefik

Tim Kleyersburg
Tim Kleyersburg on December 8, 2024 4 minutes to read

Table of Contents

#What is Firecrawl?

Turn websites into LLM-ready data
Power your AI apps with clean data crawled from any website. It’s also open-source.

Source

It’s a project that enables you to scrape and crawl websites and get well-formatted data. It also handles JavaScript rendering, which is a common issue when scraping websites. And the best part: it’s open source and you can host it yourself.

#Installing Firecrawl with Docker

Firecrawl provides everything you need to get started quickly. You may use the docker-compose.yaml file provided in their GitHub repository.

Make sure to clone the whole repository as the docker-compose.yaml file references other files in the repository.

Copy the .env.example file from the directory apps/api into the same directory where you have the docker-compose.yaml file and rename it to .env. If you don’t have reasons to change anything the default values work fine.

#Integrating with Traefik

If, like me, you are using Traefik as a reverse proxy, use the following example configuration to configure Traefik with labels.

This assumes you are using an external network called web to route all your traffic through Traefik.

I’ve used the headerauthentication plugin by omar-shrbajy-arive. This plugin allows you to define a specific header key and value that must be present in the request to access Firecrawl. You can also use it without authentication, but I prefer and recommend not to leave a web scraper open to the internet for anyone to use.

To use this plugin, add the following configuration to your Traefik static configuration file (traefik.toml).

1[experimental.plugins.headerauthentication]
2 moduleName = "github.com/omar-shrbajy-arive/headerauthentication"
3 version = "v1.0.3"

Now you can use the following docker-compose.yaml file to set up Firecrawl with Traefik, it is the exact same as the one provided by Firecrawl, but with the Traefik labels added.

1name: firecrawl
2 
3x-common-service: &common-service
4 build: apps/api
5 networks:
6 - backend
7 extra_hosts:
8 - "host.docker.internal:host-gateway"
9 
10services:
11 playwright-service:
12 build: apps/playwright-service
13 environment:
14 - PORT=3000
15 - PROXY_SERVER=${PROXY_SERVER}
16 - PROXY_USERNAME=${PROXY_USERNAME}
17 - PROXY_PASSWORD=${PROXY_PASSWORD}
18 - BLOCK_MEDIA=${BLOCK_MEDIA}
19 networks:
20 - backend
21 
22 api:
23 <<: *common-service
24 labels:
25 - "traefik.http.middlewares.firecrawl.plugin.headerauthentication.header.name=Authorization"
26 - "traefik.http.middlewares.firecrawl.plugin.headerauthentication.header.key=Bearer ${BEARER_TOKEN}"
27 - "traefik.http.routers.firecrawl.rule=Host(`firecrawl.your-domain.com`)"
28 - "traefik.http.routers.firecrawl.tls=true"
29 - "traefik.http.routers.firecrawl.tls.certresolver=lets-encrypt"
30 - "traefik.http.routers.firecrawl.middlewares=firecrawl"
31 environment:
32 REDIS_URL: ${REDIS_URL:-redis://redis:6379}
33 REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379}
34 PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
35 USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION}
36 PORT: ${PORT:-3002}
37 NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE}
38 OPENAI_API_KEY: ${OPENAI_API_KEY}
39 OPENAI_BASE_URL: ${OPENAI_BASE_URL}
40 MODEL_NAME: ${MODEL_NAME:-gpt-4o}
41 SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
42 LLAMAPARSE_API_KEY: ${LLAMAPARSE_API_KEY}
43 LOGTAIL_KEY: ${LOGTAIL_KEY}
44 BULL_AUTH_KEY: ${BULL_AUTH_KEY}
45 TEST_API_KEY: ${TEST_API_KEY}
46 POSTHOG_API_KEY: ${POSTHOG_API_KEY}
47 POSTHOG_HOST: ${POSTHOG_HOST}
48 SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN}
49 SUPABASE_URL: ${SUPABASE_URL}
50 SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN}
51 SCRAPING_BEE_API_KEY: ${SCRAPING_BEE_API_KEY}
52 HOST: ${HOST:-0.0.0.0}
53 SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL}
54 LOGGING_LEVEL: ${LOGGING_LEVEL}
55 FLY_PROCESS_GROUP: app
56 depends_on:
57 - redis
58 - playwright-service
59 ports:
60 - "3002:3002"
61 command: [ "pnpm", "run", "start:production" ]
62 networks:
63 - web
64 
65 worker:
66 <<: *common-service
67 environment:
68 REDIS_URL: ${REDIS_URL:-redis://redis:6379}
69 REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379}
70 PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
71 USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION}
72 PORT: ${PORT:-3002}
73 NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE}
74 OPENAI_API_KEY: ${OPENAI_API_KEY}
75 OPENAI_BASE_URL: ${OPENAI_BASE_URL}
76 MODEL_NAME: ${MODEL_NAME:-gpt-4o}
77 SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
78 LLAMAPARSE_API_KEY: ${LLAMAPARSE_API_KEY}
79 LOGTAIL_KEY: ${LOGTAIL_KEY}
80 BULL_AUTH_KEY: ${BULL_AUTH_KEY}
81 TEST_API_KEY: ${TEST_API_KEY}
82 POSTHOG_API_KEY: ${POSTHOG_API_KEY}
83 POSTHOG_HOST: ${POSTHOG_HOST}
84 SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN}
85 SUPABASE_URL: ${SUPABASE_URL}
86 SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN}
87 SCRAPING_BEE_API_KEY: ${SCRAPING_BEE_API_KEY}
88 HOST: ${HOST:-0.0.0.0}
89 SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL}
90 LOGGING_LEVEL: ${LOGGING_LEVEL}
91 FLY_PROCESS_GROUP: worker
92 depends_on:
93 - redis
94 - playwright-service
95 - api
96 command: [ "pnpm", "run", "workers" ]
97 
98 redis:
99 image: redis:alpine
100 networks:
101 - backend
102 command: redis-server --bind 0.0.0.0
103 
104networks:
105 backend:
106 driver: bridge
107 web:
108 external: true

Please create an .env file which holds the value of the authorization bearer. For example:

1BEARER_TOKEN=your-token

#Conclusion

You should now have a fully functional Firecrawl instance running on your server. Make sure to point your domain to the server.


You might find these related articles helpful or interesting, make sure to check them out!

I hope you found this article useful! 😊.