Skip to main content

Scraping

All API requests must be authenticated with an API key sent in the x-api-key HTTP header. You can obtain an API key by signing up for an account at rocketscraper.com/signup.

POST /scrape

Extracts structured data from a webpage according to your specified schema.

Authentication

Include your API key in the x-api-key HTTP header:

x-api-key: YOUR_API_KEY

Request Parameters

ParameterTypeRequiredDescription
urlstringYesThe URL of the webpage to scrape
schemaobjectYesThe structure definition for the data to be extracted
task_descriptionstringNoAdditional instructions for the AI system (recommended for complex tasks such as summarization, sentiment analysis, translation, etc.)

Schema Types

The schema defines the structure of the data you want to extract. Supported data types include:

TypeDescription
booleanRepresents a true or false value
integerRepresents an integer value
numberRepresents any numeric value, including integers and floating-point numbers
stringRepresents a sequence of characters
arrayRepresents an ordered list of items
objectRepresents a JSON object, which is a collection of key-value pairs

Schema Best Practices

When defining your schema, use descriptive field names that clearly communicate your extraction requirements to the AI. The more specific and descriptive your field names are, the better the AI can understand and fulfill your requirements.

Examples of Good vs Basic Field Names

Basic FieldBetter Field NameDescription
pricecurrentSalePriceUSDSpecifies currency and price type
datepublicationDateISOIndicates expected date format
descriptionproductShortDescriptionClarifies the type and length of description
ratingaverageUserRatingOutOf5Specifies the rating scale
featurestechnicalSpecificationsMore precise about the expected content

Example with Descriptive Fields

{
"productName": "string",
"manufacturerBrandName": "string",
"currentSalePriceUSD": "number",
"originalRetailPriceUSD": "number",
"productShortDescription": "string",
"technicalSpecifications": [{
"specificationName": "string",
"specificationValue": "string"
}],
"averageUserRatingOutOf5": "number",
"totalUserReviews": "integer",
"inStockStatus": "boolean",
"estimatedShippingDaysRange": {
"minimum": "integer",
"maximum": "integer"
}
}

Basic Example

Here's a basic example of scraping product information. The AI model performs information extraction by analyzing the webpage content and identifying the requested data points based on context, layout, and semantic understanding - no CSS selectors or XPath required:

from rocketscraper import RocketClient

try:
client = RocketClient('YOUR_API_KEY')

schema = {
"title": "string",
"price": "number",
"inStock": "boolean"
}

result = client.scrape('https://example.com/product', schema)
print(result)
except Exception as e:
print(f"Error: {e}")

Example Response

{
"title": "Wireless Bluetooth Headphones",
"price": 79.99,
"inStock": true
}

Advanced Example with Task Description

The task_description parameter allows you to provide detailed instructions to guide the AI system in complex extraction and analysis tasks. While simple data extraction might work well with just a schema, adding a task description becomes invaluable when dealing with nuanced requirements or when the desired output requires multiple processing steps.

Task descriptions are particularly effective for:

Text Summarization and Analysis When extracting article content, you can guide the AI to focus on specific aspects like key findings, methodology, and implications. For example, you might instruct the system to "Create a three-paragraph summary where the first paragraph covers the main announcement, the second details the methodology, and the third discusses potential industry impact."

Sentiment Analysis with Custom Parameters Rather than getting a simple positive/negative classification, you can specify exactly how sentiment should be evaluated. For instance: "Analyze sentiment by considering technical specifications, user reviews, and price-to-feature ratio, with extra weight given to professional reviewer opinions."

Language Translation with Context When dealing with multilingual content, you can provide context-specific translation instructions like "Translate product descriptions while maintaining technical terminology in English" or "Adapt idiomatic expressions to target culture while preserving the original meaning."

Complex Data Relationships For websites where related information isn't directly connected, you can guide the AI to make logical connections. For example: "Cross-reference product specifications with compatibility information listed in different sections of the page, and create a consolidated compatibility matrix."

Custom Formatting and Validation Rules You can specify exact formatting requirements: "Extract prices across different currencies, normalize them to USD using current exchange rates, and format them with exactly two decimal places."

Here's an example showing how to use task descriptions for complex scraping tasks like news summarization:

from rocketscraper import RocketClient

try:
client = RocketClient('YOUR_API_KEY')

schema = {
"title": "string",
"content": "string",
"summary": "string",
"sentiment": "string",
"key_points": [
{
"description": "string"
}
]
}

task_description = """
Extract and analyze the article content following these steps:

1. Create a concise 3-sentence summary that covers:
- Main announcement or finding
- Key technical details or methodology
- Potential impact or implications

2. Analyze the overall sentiment considering:
- Language tone and word choice
- Reported outcomes and implications
- Expert opinions and quotes
Return either 'positive', 'negative', or 'neutral'

3. Extract 3-5 key points that:
- Highlight major findings
- Include relevant statistics or data
- Capture significant implications
"""

result = client.scrape(
'https://example.com/news-article',
schema,
task_description=task_description
)
print(result)
except Exception as e:
print(f"Error: {e}")

Example Response

{
"title": "Breaking News: Tech Innovation",
"content": "Silicon Valley startup TechCorp unveiled a groundbreaking quantum computing breakthrough today. The new technology promises to solve complex calculations in seconds that would take traditional computers years to process. Early testing shows the system operating at unprecedented efficiency levels, with potential applications ranging from drug discovery to climate modeling.",
"summary": "A groundbreaking quantum computing breakthrough was announced by TechCorp. The new system can perform complex calculations exponentially faster than traditional computers. Early testing demonstrates exceptional efficiency with wide-ranging potential applications.",
"sentiment": "positive",
"key_points": [
{
"description": "TechCorp announced a groundbreaking quantum computing breakthrough"
},
{
"description": "The new technology promises to solve complex calculations exponentially faster than traditional computers"
},
{
"description": "Early testing demonstrates exceptional efficiency with wide-ranging potential applications"
}
]
}

Response Format

The API response will always match the structure defined in your schema, returning extracted data in the exact format and types you specified. Any fields that cannot be extracted will return null.