LongWriterZero AI Documentation
Complete technical documentation for implementing and using LongWriterZero AI in your projects. From basic installation to advanced usage patterns.
Table of Contents
Getting Started
LongWriterZero AI is a 32-billion parameter language model specifically designed for generating coherent long-form content. This guide will help you get started with the model quickly and efficiently.
Prerequisites
- • Python 3.8 or higher
- • PyTorch 1.13.0 or higher
- • Transformers library 4.21.0 or higher
- • Minimum 64GB RAM (recommended for optimal performance)
- • CUDA-compatible GPU (optional but recommended)
Quick Start
# Install required packages pip install transformers torch accelerate # Basic usage example from transformers import AutoTokenizer, AutoModelForCausalLM # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongWriter-Zero-32B") model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongWriter-Zero-32B") # Generate text prompt = "Write a comprehensive analysis of artificial intelligence in modern society:" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=2048, do_sample=True, temperature=0.7) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Installation Methods
Method 1: Hugging Face Transformers
The recommended approach for most users:
# Install transformers with PyTorch pip install transformers[torch] accelerate # For GPU support (CUDA 11.8) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Load the model python -c " from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained('THU-KEG/LongWriter-Zero-32B') model = AutoModelForCausalLM.from_pretrained('THU-KEG/LongWriter-Zero-32B') print('Model loaded successfully!') "
Method 2: Ollama (Local Deployment)
For local deployment and offline usage:
# Install Ollama (macOS/Linux) curl -fsSL https://ollama.ai/install.sh | sh # Pull the LongWriterZero model ollama pull gurubot/longwriter-zero-32b # Run the model ollama run gurubot/longwriter-zero-32b # Example usage >>> Write a detailed research paper introduction about machine learning applications...
Note: Ollama requires approximately 20GB of disk space and 64GB RAM for optimal performance.
Method 3: Docker Container
For containerized deployment:
# Create Dockerfile FROM python:3.9-slim RUN pip install transformers torch accelerate WORKDIR /app COPY your_script.py . CMD ["python", "your_script.py"] # Build and run docker build -t longwriter-zero . docker run --gpus all -it longwriter-zero
Basic Usage Examples
Simple Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Initialize model and tokenizer tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongWriter-Zero-32B") model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongWriter-Zero-32B") # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) def generate_long_text(prompt, max_length=8192): # Tokenize input inputs = tokenizer(prompt, return_tensors="pt").to(device) # Generate text with torch.no_grad(): outputs = model.generate( **inputs, max_length=max_length, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id ) # Decode and return generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) return generated_text[len(prompt):] # Remove prompt from output # Example usage prompt = "Write a comprehensive guide to sustainable energy:" result = generate_long_text(prompt, max_length=4096) print(result)
Batch Processing
def batch_generate(prompts, max_length=2048, batch_size=4): results = [] for i in range(0, len(prompts), batch_size): batch_prompts = prompts[i:i+batch_size] # Tokenize batch inputs = tokenizer( batch_prompts, return_tensors="pt", padding=True, truncation=True ).to(device) # Generate for batch with torch.no_grad(): outputs = model.generate( **inputs, max_length=max_length, do_sample=True, temperature=0.7, pad_token_id=tokenizer.eos_token_id ) # Decode batch results batch_results = [ tokenizer.decode(output, skip_special_tokens=True) for output in outputs ] results.extend(batch_results) return results # Example usage prompts = [ "Explain quantum computing:", "Describe the history of artificial intelligence:", "Write about renewable energy technologies:" ] results = batch_generate(prompts) for i, result in enumerate(results): print(f"Result {i+1}: {result[:200]}...")
API Reference
Generation Parameters
Parameter | Type | Default | Description |
---|---|---|---|
max_length | int | 2048 | Maximum number of tokens to generate |
temperature | float | 1.0 | Controls randomness (0.1-2.0) |
top_p | float | 1.0 | Nucleus sampling parameter (0.0-1.0) |
top_k | int | 50 | Top-k sampling parameter |
repetition_penalty | float | 1.0 | Penalty for repeated tokens (1.0-1.2) |
Memory Optimization
# Enable gradient checkpointing to reduce memory usage model.gradient_checkpointing_enable() # Use 8-bit quantization for lower memory requirements from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False ) model = AutoModelForCausalLM.from_pretrained( "THU-KEG/LongWriter-Zero-32B", quantization_config=quantization_config, device_map="auto" )
Configuration Options
Environment Variables
# Set cache directory for model downloads export TRANSFORMERS_CACHE=/path/to/cache # Set device for inference export CUDA_VISIBLE_DEVICES=0,1 # Enable mixed precision export TORCH_DTYPE=float16 # Set logging level export TRANSFORMERS_VERBOSITY=info
Model Configuration
from transformers import AutoConfig # Load and modify configuration config = AutoConfig.from_pretrained("THU-KEG/LongWriter-Zero-32B") # Modify attention settings config.max_position_embeddings = 32768 # Maximum context length config.use_cache = True # Enable KV cache for faster generation # Load model with custom config model = AutoModelForCausalLM.from_pretrained( "THU-KEG/LongWriter-Zero-32B", config=config, torch_dtype=torch.float16, device_map="auto" )
Best Practices
Prompt Engineering
- • Use clear, specific instructions at the beginning of your prompt
- • Provide context and examples when possible
- • Specify the desired length and format of the output
- • Use consistent formatting and structure in your prompts
Performance Optimization
- • Use GPU acceleration when available (CUDA/ROCm)
- • Enable mixed precision training with torch.float16
- • Implement batch processing for multiple requests
- • Use model parallelism for very large models
- • Consider using quantization for memory-constrained environments
Memory Management
- • Monitor GPU memory usage during inference
- • Use gradient checkpointing to reduce memory footprint
- • Clear cache between generation calls if needed
- • Consider using CPU offloading for very large sequences
Troubleshooting
Common Issues
Out of Memory (OOM) Errors
If you encounter CUDA out of memory errors, try these solutions:
- • Reduce batch size or max_length
- • Enable gradient checkpointing
- • Use 8-bit quantization
- • Enable CPU offloading
Slow Generation Speed
To improve generation speed:
- • Use GPU acceleration
- • Enable use_cache=True
- • Use torch.compile() (PyTorch 2.0+)
- • Consider using smaller context windows
Installation Issues
For installation problems:
- • Ensure Python version compatibility (3.8+)
- • Update pip and setuptools
- • Check CUDA version compatibility
- • Try installing in a fresh virtual environment
Getting Help
If you need additional support: