LongWriterZero AI Documentation
Complete technical documentation for implementing and using LongWriterZero AI in your projects. From basic installation to advanced usage patterns.
Table of Contents
Getting Started
LongWriterZero AI is a 32-billion parameter language model specifically designed for generating coherent long-form content. This guide will help you get started with the model quickly and efficiently.
Prerequisites
- • Python 3.8 or higher
- • PyTorch 1.13.0 or higher
- • Transformers library 4.21.0 or higher
- • Minimum 64GB RAM (recommended for optimal performance)
- • CUDA-compatible GPU (optional but recommended)
Quick Start
# Install required packages
pip install transformers torch accelerate
# Basic usage example
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongWriter-Zero-32B")
model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongWriter-Zero-32B")
# Generate text
prompt = "Write a comprehensive analysis of artificial intelligence in modern society:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=2048, do_sample=True, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)Installation Methods
Method 1: Hugging Face Transformers
The recommended approach for most users:
# Install transformers with PyTorch
pip install transformers[torch] accelerate
# For GPU support (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Load the model
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('THU-KEG/LongWriter-Zero-32B')
model = AutoModelForCausalLM.from_pretrained('THU-KEG/LongWriter-Zero-32B')
print('Model loaded successfully!')
"Method 2: Ollama (Local Deployment)
For local deployment and offline usage:
# Install Ollama (macOS/Linux) curl -fsSL https://ollama.ai/install.sh | sh # Pull the LongWriterZero model ollama pull gurubot/longwriter-zero-32b # Run the model ollama run gurubot/longwriter-zero-32b # Example usage >>> Write a detailed research paper introduction about machine learning applications...
Note: Ollama requires approximately 20GB of disk space and 64GB RAM for optimal performance.
Method 3: Docker Container
For containerized deployment:
# Create Dockerfile FROM python:3.9-slim RUN pip install transformers torch accelerate WORKDIR /app COPY your_script.py . CMD ["python", "your_script.py"] # Build and run docker build -t longwriter-zero . docker run --gpus all -it longwriter-zero
Basic Usage Examples
Simple Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Initialize model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongWriter-Zero-32B")
model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongWriter-Zero-32B")
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def generate_long_text(prompt, max_length=8192):
# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate text
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# Decode and return
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text[len(prompt):] # Remove prompt from output
# Example usage
prompt = "Write a comprehensive guide to sustainable energy:"
result = generate_long_text(prompt, max_length=4096)
print(result)Batch Processing
def batch_generate(prompts, max_length=2048, batch_size=4):
results = []
for i in range(0, len(prompts), batch_size):
batch_prompts = prompts[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(
batch_prompts,
return_tensors="pt",
padding=True,
truncation=True
).to(device)
# Generate for batch
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id
)
# Decode batch results
batch_results = [
tokenizer.decode(output, skip_special_tokens=True)
for output in outputs
]
results.extend(batch_results)
return results
# Example usage
prompts = [
"Explain quantum computing:",
"Describe the history of artificial intelligence:",
"Write about renewable energy technologies:"
]
results = batch_generate(prompts)
for i, result in enumerate(results):
print(f"Result {i+1}: {result[:200]}...")API Reference
Generation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| max_length | int | 2048 | Maximum number of tokens to generate |
| temperature | float | 1.0 | Controls randomness (0.1-2.0) |
| top_p | float | 1.0 | Nucleus sampling parameter (0.0-1.0) |
| top_k | int | 50 | Top-k sampling parameter |
| repetition_penalty | float | 1.0 | Penalty for repeated tokens (1.0-1.2) |
Memory Optimization
# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()
# Use 8-bit quantization for lower memory requirements
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False
)
model = AutoModelForCausalLM.from_pretrained(
"THU-KEG/LongWriter-Zero-32B",
quantization_config=quantization_config,
device_map="auto"
)Configuration Options
Environment Variables
# Set cache directory for model downloads export TRANSFORMERS_CACHE=/path/to/cache # Set device for inference export CUDA_VISIBLE_DEVICES=0,1 # Enable mixed precision export TORCH_DTYPE=float16 # Set logging level export TRANSFORMERS_VERBOSITY=info
Model Configuration
from transformers import AutoConfig
# Load and modify configuration
config = AutoConfig.from_pretrained("THU-KEG/LongWriter-Zero-32B")
# Modify attention settings
config.max_position_embeddings = 32768 # Maximum context length
config.use_cache = True # Enable KV cache for faster generation
# Load model with custom config
model = AutoModelForCausalLM.from_pretrained(
"THU-KEG/LongWriter-Zero-32B",
config=config,
torch_dtype=torch.float16,
device_map="auto"
)Best Practices
Prompt Engineering
- • Use clear, specific instructions at the beginning of your prompt
- • Provide context and examples when possible
- • Specify the desired length and format of the output
- • Use consistent formatting and structure in your prompts
Performance Optimization
- • Use GPU acceleration when available (CUDA/ROCm)
- • Enable mixed precision training with torch.float16
- • Implement batch processing for multiple requests
- • Use model parallelism for very large models
- • Consider using quantization for memory-constrained environments
Memory Management
- • Monitor GPU memory usage during inference
- • Use gradient checkpointing to reduce memory footprint
- • Clear cache between generation calls if needed
- • Consider using CPU offloading for very large sequences
Troubleshooting
Common Issues
Out of Memory (OOM) Errors
If you encounter CUDA out of memory errors, try these solutions:
- • Reduce batch size or max_length
- • Enable gradient checkpointing
- • Use 8-bit quantization
- • Enable CPU offloading
Slow Generation Speed
To improve generation speed:
- • Use GPU acceleration
- • Enable use_cache=True
- • Use torch.compile() (PyTorch 2.0+)
- • Consider using smaller context windows
Installation Issues
For installation problems:
- • Ensure Python version compatibility (3.8+)
- • Update pip and setuptools
- • Check CUDA version compatibility
- • Try installing in a fresh virtual environment
Getting Help
If you need additional support: