How to Parse Proxy List Formats (TXT, CSV, JSON)

How to Parse Proxy List Formats (TXT, CSV, JSON)

Anatomy of Proxy List Formats

With the nimbleness of a digital flâneur, let’s wander through the thickets of proxy list formats—TXT, CSV, JSON. Each format, a subtle dialect, whispers its secrets in the syntax. Understanding their anatomy is the first step to parsing their essence.

Format Structure Common Delimiters Typical Fields
TXT Line-based Colon, space IP, Port, Username, Password
CSV Row-based Comma, semicolon IP, Port, Username, Password
JSON Object/Array None (structured) IP, Port, Username, Password

Parsing TXT Proxy Lists

Structure

The TXT format, spare and utilitarian, often arrives as a procession of lines. Each line, a vignette:

192.168.1.1:8080
203.0.113.42:3128:username:password

Parsing Logic

  1. Line-by-Line Reading: Each line is a proxy entry.
  2. Delimiter Detection: Colon (:) is the prevailing delimiter. Occasionally, whitespace or tab breathes between fields.
  3. Splitting Fields: The number of components per line determines available data—IP, Port, and optionally credentials.

Python Example

proxies = []
with open('proxies.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(':')
        if len(parts) == 2:
            ip, port = parts
            proxies.append({'ip': ip, 'port': port})
        elif len(parts) == 4:
            ip, port, user, pwd = parts
            proxies.append({'ip': ip, 'port': port, 'username': user, 'password': pwd})

Common Pitfalls

  • Mixed Delimiters: Some lists may mix colons and spaces. A gentle regex, like a Parisian boulevard, can accommodate both.
  • Trailing Whitespace: Strip with devotion, lest your parsing stumbles.

Parsing CSV Proxy Lists

Structure

CSV, the bourgeoisie of data, insists on order. Fields are separated by commas or, in francophone circles, semicolons:

ip,port,username,password
192.168.1.1,8080,,
203.0.113.42,3128,myuser,mypassword

Parsing Logic

  1. Header Recognition: The first row often names the fields.
  2. Delimiter Declaration: Specify the delimiter; CSVs are capricious.
  3. Row Iteration: Each row is a proxy; empty fields are to be expected.

Python Example

import csv

proxies = []
with open('proxies.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        proxies.append({
            'ip': row.get('ip', ''),
            'port': row.get('port', ''),
            'username': row.get('username', ''),
            'password': row.get('password', ''),
        })

Common Pitfalls

  • Quoted Fields: CSVs sometimes wrap fields in quotes, especially if passwords contain commas.
  • Missing Headers: When headers are absent, enumerate columns with care.

Parsing JSON Proxy Lists

Structure

JSON, the modernist manifesto. Structured, self-describing, draped in curly braces:

[
    {"ip": "192.168.1.1", "port": 8080},
    {"ip": "203.0.113.42", "port": 3128, "username": "myuser", "password": "mypassword"}
]

Parsing Logic

  1. Load as Native Structure: JSON deserializes into dictionaries or lists.
  2. Field Extraction: Access fields directly, their presence or absence a matter of elegant optionality.

Python Example

import json

with open('proxies.json') as f:
    proxies = json.load(f)
    # proxies is now a list of dicts, each with ip, port, and optional credentials

Common Pitfalls

  • Malformed JSON: A missing comma, a stray bracket, and the whole edifice collapses.
  • Data Types: Ports may come as integers or strings; harmonize types post-parsing.

Comparative Table: TXT vs CSV vs JSON

Feature TXT CSV JSON
Human readability High Medium High
Parsing complexity Low–Medium Medium Low
Support for metadata None Possible via headers Extensive
Common delimiters Colon, space Comma, semicolon N/A (structured)
Handles credentials Sometimes Yes Yes
Suitability for bulk High High High

Handling Inconsistencies and Edge Cases

Mixed Formats

Sometimes, the world rebels against neatness—a TXT file with comma delimiters, a CSV without headers, a JSON array of arrays. To parse through chaos:

  • Auto-detect Delimiters: Use Python’s csv.Sniffer or test for delimiters with regexes.
  • Flexible Field Mapping: When headers are absent, map fields by position, but allow for optional ones (e.g., username/password).
  • Graceful Fallbacks: Wrap parsing in try/except; log and skip corrupted entries with the sangfroid of a boulevardier.

Unicode and Encoding

The proxy list, a cosmopolitan artefact, may arrive in UTF-8, Latin-1, or worse. Always specify encoding:

with open('proxies.txt', encoding='utf-8') as f:
    # parse as usual

Transforming Proxy Data for Use

Once parsed, proxies often need to be formatted for HTTP clients or libraries—stringifying with credentials:

def format_proxy(proxy):
    if proxy.get('username') and proxy.get('password'):
        return f"http://{proxy['username']}:{proxy['password']}@{proxy['ip']}:{proxy['port']}"
    else:
        return f"http://{proxy['ip']}:{proxy['port']}"

Aesthetic Touches: The Syntax of Parsing

Parsing is not mere automation; it’s the art of listening to data’s accent. Be attentive to the subtle inflections—delimiters, missing fields, the occasional out-of-place character—and let your code adapt with the elegance of a well-spoken phrase.

With these techniques, your parser becomes a cosmopolite—at home in any proxy list salon, ready to converse fluently with TXT, CSV, or JSON, and to extract from each the vital, beating heart of connection.

Théophile Beauvais

Théophile Beauvais

Proxy Analyst

Théophile Beauvais is a 21-year-old Proxy Analyst at ProxyMist, where he specializes in curating and updating comprehensive lists of proxy servers from across the globe. With an innate aptitude for technology and cybersecurity, Théophile has become a pivotal member of the team, ensuring the delivery of reliable SOCKS, HTTP, elite, and anonymous proxy servers for free to users worldwide. Born and raised in the picturesque city of Lyon, Théophile's passion for digital privacy and innovation was sparked at a young age.

Comments (0)

There are no comments here yet, you can be the first!

Leave a Reply

Your email address will not be published. Required fields are marked *