Turn websites into LLM-ready data
Power your AI apps with clean data crawled from any website. It’s also open-source.
It’s a project that enables you to scrape and crawl websites and get well-formatted data. It also handles JavaScript rendering, which is a common issue when scraping websites. And the best part: it’s open source and you can host it yourself.
Firecrawl provides everything you need to get started quickly. You may use the docker-compose.yaml file provided in their GitHub repository.
Make sure to clone the whole repository as the docker-compose.yaml file references other files in the repository.
Copy the .env.example file from the directory apps/api into the same directory where you have the docker-compose.yaml file and rename it to .env. If you don’t have reasons to change anything the default values work fine.
If, like me, you are using Traefik as a reverse proxy, use the following example configuration to configure Traefik with labels.
This assumes you are using an external network called web to route all your traffic through Traefik.
I’ve used the headerauthentication plugin by omar-shrbajy-arive. This plugin allows you to define a specific header key and value that must be present in the request to access Firecrawl. You can also use it without authentication, but I prefer and recommend not to leave a web scraper open to the internet for anyone to use.
To use this plugin, add the following configuration to your Traefik static configuration file (traefik.toml).
1[experimental.plugins.headerauthentication]2 moduleName = "github.com/omar-shrbajy-arive/headerauthentication"3 version = "v1.0.3"
Now you can use the following docker-compose.yaml file to set up Firecrawl with Traefik, it is the exact same as the one provided by Firecrawl, but with the Traefik labels added.
1name: firecrawl 2 3x-common-service: &common-service 4 build: apps/api 5 networks: 6 - backend 7 extra_hosts: 8 - "host.docker.internal:host-gateway" 9 10services: 11 playwright-service: 12 build: apps/playwright-service 13 environment: 14 - PORT=3000 15 - PROXY_SERVER=${PROXY_SERVER} 16 - PROXY_USERNAME=${PROXY_USERNAME} 17 - PROXY_PASSWORD=${PROXY_PASSWORD} 18 - BLOCK_MEDIA=${BLOCK_MEDIA} 19 networks: 20 - backend 21 22 api: 23 <<: *common-service 24 labels: 25 - "traefik.http.middlewares.firecrawl.plugin.headerauthentication.header.name=Authorization" 26 - "traefik.http.middlewares.firecrawl.plugin.headerauthentication.header.key=Bearer ${BEARER_TOKEN}" 27 - "traefik.http.routers.firecrawl.rule=Host(`firecrawl.your-domain.com`)" 28 - "traefik.http.routers.firecrawl.tls=true" 29 - "traefik.http.routers.firecrawl.tls.certresolver=lets-encrypt" 30 - "traefik.http.routers.firecrawl.middlewares=firecrawl" 31 environment: 32 REDIS_URL: ${REDIS_URL:-redis://redis:6379} 33 REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379} 34 PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000} 35 USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION} 36 PORT: ${PORT:-3002} 37 NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE} 38 OPENAI_API_KEY: ${OPENAI_API_KEY} 39 OPENAI_BASE_URL: ${OPENAI_BASE_URL} 40 MODEL_NAME: ${MODEL_NAME:-gpt-4o} 41 SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL} 42 LLAMAPARSE_API_KEY: ${LLAMAPARSE_API_KEY} 43 LOGTAIL_KEY: ${LOGTAIL_KEY} 44 BULL_AUTH_KEY: ${BULL_AUTH_KEY} 45 TEST_API_KEY: ${TEST_API_KEY} 46 POSTHOG_API_KEY: ${POSTHOG_API_KEY} 47 POSTHOG_HOST: ${POSTHOG_HOST} 48 SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN} 49 SUPABASE_URL: ${SUPABASE_URL} 50 SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN} 51 SCRAPING_BEE_API_KEY: ${SCRAPING_BEE_API_KEY} 52 HOST: ${HOST:-0.0.0.0} 53 SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL} 54 LOGGING_LEVEL: ${LOGGING_LEVEL} 55 FLY_PROCESS_GROUP: app 56 depends_on: 57 - redis 58 - playwright-service 59 ports: 60 - "3002:3002" 61 command: [ "pnpm", "run", "start:production" ] 62 networks: 63 - web 64 65 worker: 66 <<: *common-service 67 environment: 68 REDIS_URL: ${REDIS_URL:-redis://redis:6379} 69 REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6379} 70 PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000} 71 USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION} 72 PORT: ${PORT:-3002} 73 NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE} 74 OPENAI_API_KEY: ${OPENAI_API_KEY} 75 OPENAI_BASE_URL: ${OPENAI_BASE_URL} 76 MODEL_NAME: ${MODEL_NAME:-gpt-4o} 77 SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL} 78 LLAMAPARSE_API_KEY: ${LLAMAPARSE_API_KEY} 79 LOGTAIL_KEY: ${LOGTAIL_KEY} 80 BULL_AUTH_KEY: ${BULL_AUTH_KEY} 81 TEST_API_KEY: ${TEST_API_KEY} 82 POSTHOG_API_KEY: ${POSTHOG_API_KEY} 83 POSTHOG_HOST: ${POSTHOG_HOST} 84 SUPABASE_ANON_TOKEN: ${SUPABASE_ANON_TOKEN} 85 SUPABASE_URL: ${SUPABASE_URL} 86 SUPABASE_SERVICE_TOKEN: ${SUPABASE_SERVICE_TOKEN} 87 SCRAPING_BEE_API_KEY: ${SCRAPING_BEE_API_KEY} 88 HOST: ${HOST:-0.0.0.0} 89 SELF_HOSTED_WEBHOOK_URL: ${SELF_HOSTED_WEBHOOK_URL} 90 LOGGING_LEVEL: ${LOGGING_LEVEL} 91 FLY_PROCESS_GROUP: worker 92 depends_on: 93 - redis 94 - playwright-service 95 - api 96 command: [ "pnpm", "run", "workers" ] 97 98 redis: 99 image: redis:alpine100 networks:101 - backend102 command: redis-server --bind 0.0.0.0103 104networks:105 backend:106 driver: bridge107 web:108 external: true
Please create an .env file which holds the value of the authorization bearer. For example:
1BEARER_TOKEN=your-token
You should now have a fully functional Firecrawl instance running on your server. Make sure to point your domain to the server.
You might find these related articles helpful or interesting, make sure to check them out!
I hope you found this article useful! 😊.