seaweedfs

Author	SHA1	Message	Date
Chris Lu	503b6f2744	fix(seaweed-volume): ceil EC shard slots in maybe_adjust_volume_max (#9232 ) Some checks failed go: build dev binaries / cleanup (push) Has been cancelled Details docker: build dev containers / build-rust-binaries (amd64, x86_64-unknown-linux-musl) (push) Has been cancelled Details docker: build dev containers / build-rust-binaries (arm64, true, aarch64-unknown-linux-musl) (push) Has been cancelled Details End to End / FUSE Mount (push) Has been cancelled Details EC Integration Tests / EC Integration Tests (push) Has been cancelled Details go: build binary / Go Vet (push) Has been cancelled Details go: build binary / Build (push) Has been cancelled Details go: build binary / Test (push) Has been cancelled Details Kafka Quick Test (Load Test with Schema Registry) / Kafka Client Load Test (Quick) (push) Has been cancelled Details Kafka Gateway Tests / Kafka Unit Tests (unit-tests-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka Integration Tests (Critical) (integration-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka End-to-End Tests (with SMQ) (e2e-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka Consumer Group Tests (Highly Isolated) (consumer-group-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka Client Compatibility (with SMQ) (client-compat-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka SMQ Integration Tests (Full Stack) (smq-integration-1) (push) Has been cancelled Details Kafka Gateway Tests / Kafka Protocol Tests (Isolated) (protocol-1) (push) Has been cancelled Details Plugin Worker Integration Tests / Plugin Worker: erasure_coding (push) Has been cancelled Details Plugin Worker Integration Tests / Plugin Worker: vacuum (push) Has been cancelled Details Plugin Worker Integration Tests / Plugin Worker: volume_balance (push) Has been cancelled Details PostgreSQL Gateway Tests / PostgreSQL Basic Tests (push) Has been cancelled Details Rust Volume Server Tests / Rust Unit Tests (push) Has been cancelled Details Rust Volume Server Tests / Rust Integration Tests (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (grpc - Shard 1) (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (http - Shard 1) (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (grpc - Shard 2) (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (http - Shard 2) (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (grpc - Shard 3) (push) Has been cancelled Details Rust Volume Server Tests / Go Tests with Rust Volume (http - Shard 3) (push) Has been cancelled Details rust: build dev volume server binaries / cleanup (push) Has been cancelled Details S3 Proxy Signature Tests / S3 Proxy Signature Verification Tests (push) Has been cancelled Details Ceph S3 tests / Basic S3 tests (KV store) (push) Has been cancelled Details Ceph S3 tests / S3 Versioning & Object Lock tests (push) Has been cancelled Details Ceph S3 tests / S3 CORS tests (push) Has been cancelled Details Ceph S3 tests / SeaweedFS Custom S3 Copy tests (push) Has been cancelled Details Ceph S3 tests / Basic S3 tests (SQL store) (push) Has been cancelled Details test s3 over https using aws-cli / awscli-tests (push) Has been cancelled Details TLS Rotation Integration Tests / TLS Rotation Integration Tests (push) Has been cancelled Details Vacuum Integration Tests / Vacuum Integration Tests (push) Has been cancelled Details go: build dev binaries / build_dev_linux_windows (amd64, linux) (push) Has been cancelled Details go: build dev binaries / build_dev_linux_windows (amd64, windows) (push) Has been cancelled Details go: build dev binaries / build_dev_darwin (amd64, darwin) (push) Has been cancelled Details go: build dev binaries / build_dev_darwin (arm64, darwin) (push) Has been cancelled Details docker: build dev containers / build-dev-containers (push) Has been cancelled Details rust: build dev volume server binaries / build-rust-volume-dev-linux (linux-amd64, x86_64-unknown-linux-gnu) (push) Has been cancelled Details rust: build dev volume server binaries / build-rust-volume-dev-darwin (darwin-amd64, x86_64-apple-darwin) (push) Has been cancelled Details rust: build dev volume server binaries / build-rust-volume-dev-darwin (darwin-arm64, aarch64-apple-darwin) (push) Has been cancelled Details Mirrors the volume-server side of seaweedfs/seaweedfs#9196: compute the EC-shard contribution to maxVolumeCount with proper ceiling division ((N + D - 1) / D) instead of (N + D) / D, which over-counts by one slot whenever the per-location EC-shard count is zero or an exact multiple of DataShardsCount (10). The most common case -- a location with no EC shards -- silently inflated maxVolumeCount by 1 on every recalculation. The matching low-disk effective_max_count path in heartbeat.rs already uses the correct ceiling form, and the master-side topology changes from that PR have no Rust counterpart.	2026-04-26 22:31:56 -07:00
qzh	21fadf5582	fix(shell): correct volume.list -writable filter unit and comparison (#9231 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison	2026-04-26 22:20:46 -07:00
Chris Lu	0b3cc8d121	4.22 Some checks failed go: build versioned binaries for windows / build-release-binaries_windows (amd64, windows) (push) Has been cancelled Details go: build versioned binaries for linux / build-release-binaries_linux (amd64, linux) (push) Has been cancelled Details go: build versioned binaries for linux / build-release-binaries_linux (arm, linux) (push) Has been cancelled Details go: build versioned binaries for linux / build-release-binaries_linux (arm64, linux) (push) Has been cancelled Details go: build versioned binaries for darwin / build-release-binaries_darwin (amd64, darwin) (push) Has been cancelled Details go: build versioned binaries for darwin / build-release-binaries_darwin (arm64, darwin) (push) Has been cancelled Details go: build versioned binaries for freebsd / build-release-binaries_freebsd (amd64, freebsd) (push) Has been cancelled Details go: build versioned binaries for freebsd / build-release-binaries_freebsd (arm, freebsd) (push) Has been cancelled Details go: build versioned binaries for freebsd / build-release-binaries_freebsd (arm64, freebsd) (push) Has been cancelled Details go: build versioned binaries for linux with all tags / build-release-binaries_linux (amd64, linux) (push) Has been cancelled Details go: build versioned binaries for openbsd / build-release-binaries_openbsd (amd64, openbsd) (push) Has been cancelled Details go: build versioned binaries for openbsd / build-release-binaries_openbsd (arm, openbsd) (push) Has been cancelled Details go: build versioned binaries for openbsd / build-release-binaries_openbsd (arm64, openbsd) (push) Has been cancelled Details docker: build latest container / setup (push) Has been cancelled Details docker: build latest container / build-rust-binaries (amd64, x86_64-unknown-linux-musl) (push) Has been cancelled Details docker: build latest container / build-rust-binaries (arm64, true, aarch64-unknown-linux-musl) (push) Has been cancelled Details docker: build release containers for foundationdb / build-large-release-container_foundationdb (push) Has been cancelled Details docker: build all release containers (unified) / build-rust-binaries (amd64, x86_64-unknown-linux-musl) (push) Has been cancelled Details docker: build all release containers (unified) / build-rust-binaries (arm64, true, aarch64-unknown-linux-musl) (push) Has been cancelled Details helm: release / helm-release (push) Has been cancelled Details rust: build versioned volume server binaries / build-rust-volume-linux (linux_amd64, x86_64-unknown-linux-gnu) (push) Has been cancelled Details rust: build versioned volume server binaries / build-rust-volume-linux (linux_arm64, true, aarch64-unknown-linux-gnu) (push) Has been cancelled Details rust: build versioned volume server binaries / build-rust-volume-darwin (darwin_amd64, x86_64-apple-darwin) (push) Has been cancelled Details rust: build versioned volume server binaries / build-rust-volume-darwin (darwin_arm64, aarch64-apple-darwin) (push) Has been cancelled Details rust: build versioned volume server binaries / build-rust-volume-windows (push) Has been cancelled Details Spark Integration Tests / Spark Integration Tests (push) Has been cancelled Details docker: build latest container / build (push) Has been cancelled Details docker: build latest container / trivy-scan (push) Has been cancelled Details docker: build latest container / create-manifest (push) Has been cancelled Details docker: build all release containers (unified) / build (, ./docker/Dockerfile.go_build, linux/amd64,linux/arm64,linux/arm/v7,linux/386, normal, , normal) (push) Has been cancelled Details docker: build all release containers (unified) / build (, ./docker/Dockerfile.rocksdb_large, linux/amd64, large-disk, _large_disk_rocksdb, rocksdb) (push) Has been cancelled Details docker: build all release containers (unified) / build (TAGS=5BytesOffset, ./docker/Dockerfile.go_build, linux/amd64,linux/arm64,linux/arm/v7,linux/386, large-disk, _large_disk, large_disk) (push) Has been cancelled Details docker: build all release containers (unified) / build (TAGS=5BytesOffset,elastic,gocdk,rclone,sqlite,tarantool,tikv,ydb, ./docker/Dockerfile.go_build, linux/amd64,linux/arm64, large-disk, _large_disk_full, large_disk_full) (push) Has been cancelled Details docker: build all release containers (unified) / build (TAGS=elastic,gocdk,rclone,sqlite,tarantool,tikv,ydb, ./docker/Dockerfile.go_build, linux/amd64,linux/arm64, normal, _full, full) (push) Has been cancelled Details docker: build all release containers (unified) / copy-to-dockerhub (, normal) (push) Has been cancelled Details docker: build all release containers (unified) / copy-to-dockerhub (_full, full) (push) Has been cancelled Details docker: build all release containers (unified) / copy-to-dockerhub (_large_disk, large_disk) (push) Has been cancelled Details docker: build all release containers (unified) / copy-to-dockerhub (_large_disk_full, large_disk_full) (push) Has been cancelled Details docker: build all release containers (unified) / copy-to-dockerhub (_large_disk_rocksdb, rocksdb) (push) Has been cancelled Details docker: build all release containers (unified) / helm-release (push) Has been cancelled Details 4.22	2026-04-26 21:06:39 -07:00
Chris Lu	6cbcdf488c	chore(mount,fuse-test): diagnostics for FUSE ConcurrentReadWrite ENOENT flake PR #9230 attempt 1 hit an intermittent TestConcurrentFileOperations/ConcurrentReadWrite failure where stat returned ENOENT for a path all writers had just succeeded against, and the captured mount.log carried no signal about which layer dropped the entry because the relevant lookup logged at V(4). Two diagnostic-only changes (no behavior change on the happy path): - weed/mount/weedfs.go: in lookupEntry, when filer GetEntry returns ErrNotFound for a path whose inode is still tracked locally with no in-flight create or flush, log Warningf with inode + dirtyHandle + pendingFlush + localCache + dirCached. This surfaces layer-by-layer state at the moment of the suspicious ENOENT. - test/fuse_integration/framework_test.go: on AssertFileExists failure, dump five 100ms-spaced stat retries, a parent ReadDir, and a direct O_RDONLY open before failing. Triangulates kernel dentry caching vs mount lookup vs filer state.	2026-04-26 16:57:37 -07:00
Chris Lu	c934b5dab6	fix(credential/postgres,s3api/iam): rename safety + pgxutil follow-ups to #9226 (#9230 ) * refactor(util): extract pgx OpenDB + DSN builder into shared pgxutil The postgres filer store had OpenPGXDB plus duplicated key=value DSN assembly across postgres/ and postgres2/. Move the connection helper to weed/util/pgxutil and add BuildDSN so the credential postgres store can land on the same code path. filer/postgres/pgx_conn.go keeps OpenPGXDB as a thin alias so postgres2 keeps building unchanged. * refactor(credential/postgres): use shared pgxutil for connection setup Replace the bespoke fmt.Sprintf DSN + sql.Open("pgx", ...) path with pgxutil.BuildDSN + pgxutil.OpenDB so the credential store mirrors the postgres filer store. This also drops the leaky RegisterConnConfig-style init in favor of stdlib.OpenDB(config), which doesn't accumulate entries in the global pgx config map. Adds parity knobs the filer store already exposes: sslcrl, and configurable connection_max_idle / connection_max_open / connection_max_lifetime_seconds (with the previous hardcoded 25/5/5min as defaults). Also moves the jsonbParam helper here so other store files can reuse it. (Helper is also referenced by postgres_identity.go, which is migrated to it in the next commit.) refactor(credential/postgres): use jsonbParam helper across all writers Consolidate JSONB write handling on the new pgxutil-adjacent helper jsonbParam(b []byte) interface{}, which returns nil (driver writes SQL NULL) when the marshaled JSON is empty and string(b) otherwise. postgres_identity.go: replace the inline 'var fooParam any' / 'fooParam = string(b)' pattern with the helper. Same in CreateUser and UpdateUser. postgres_inline_policy.go, postgres_policy.go, postgres_service_account.go, postgres_group.go: every JSONB writer was still passing []byte. Under pgx simple_protocol (pgbouncer_compatible=true), []byte is encoded as bytea and Postgres rejects that against a JSONB column with "invalid input syntax for type json". Route them through jsonbParam too. * fix(credential/postgres): rework SaveConfiguration to handle rename + UNIQUE access keys The IAM rename path (s3api UpdateUser) renames an identity in place and keeps its access keys. With the previous flow — upsert each user, then per-user delete-and-insert credentials, then prune absent users — the renamed user's access keys were still owned by the old row when the INSERT for the new name ran, tripping credentials.access_key's global UNIQUE constraint and failing every rename of a user with credentials. Reorder the SaveConfiguration body so the prune step runs BEFORE the credential replace. CASCADE on the old user releases its access keys in the same transaction, and the new name can then claim them. While here: - Replace the per-user loop DELETE FROM users WHERE username = $1 with a single DELETE ... WHERE username = ANY($1), one round trip instead of N inside the transaction. - Surface inline-policy CASCADE losses: count user_inline_policies for the prune set and emit a Warningf when the count is non-zero so rename-driven drops are visible in operator logs (the structural fix for renames lives at the IAM layer in a follow-up commit). - Two-pass credential replace: clear credentials for every user we are about to rewrite first, then insert, so an access key can be moved between two users in the same SaveConfiguration call. - credErr := credRows.Err() before credRows.Close() in LoadConfiguration — Err() is documented as safe after Close, but the leading-capture pattern matches the rest of the file. * fix(s3api/iam): preserve inline policies when renaming a user EmbeddedIamApi.UpdateUser renames an identity in place and the caller persists via SaveConfiguration, which prunes the old username and CASCADE-drops its rows from user_inline_policies. GetUserPolicy and ListUserPolicies then return nothing under the new name even though the API reported success — silent data loss. Before flipping sourceIdent.Name, list the user's stored inline policies and re-attach each one under the new name. The subsequent SaveConfiguration prune still CASCADE-removes the old-name rows; only the duplicates we just wrote under the new name survive. Adds a regression test that puts a policy on the old name, renames, and asserts the policy is readable under the new name. * perf(credential/postgres): batch the credential clear in SaveConfiguration The two-pass credential replace was clearing each incoming user's credentials with its own DELETE statement — N round-trips inside the transaction. Match the pattern already used for the user prune and issue a single DELETE FROM credentials WHERE username = ANY($1) instead. * refactor(s3api/iam): plumb context through UpdateUser UpdateUser was synthesizing a fresh context.Background() inside the inline-policy migration block, which discards the request deadline, cancellation, and tracing carried by the caller. Add ctx as the first parameter and pass r.Context() in via the ExecuteAction dispatcher, mirroring the signature already used by CreatePolicy / AttachUserPolicy / DetachUserPolicy. * fix(util/pgxutil): quote DSN values per libpq rules BuildDSN was concatenating values directly, so any password / cert path / database name with a space, single quote, or backslash produced a malformed connection string and pgx.ParseConfig either errored or mis-parsed the remainder. Critical now that the helper is shared with the credential store: mTLS deployments routinely sourcing passwords or secret-mounted cert paths from a vault are exactly the case where spaces and quotes show up. Add quoteDSNValue: empty values and values containing whitespace, `'`, or `\` are wrapped in single quotes with `'` and `\` escaped per PostgreSQL libpq rules; plain alphanumeric values pass through unchanged. Apply it to every variable field in BuildDSN. Adds a test that round-trips a password containing spaces, quotes and backslashes through pgx.ParseConfig and confirms the parsed Config matches the input. * fix(credential,s3api/iam): atomic UserRenamer to avoid FK violation on rename The previous IAM rename path called PutUserInlinePolicy(newName, ...) before SaveConfiguration created the new users row. user_inline_policies has a non-deferrable FOREIGN KEY (username) REFERENCES users(username), which Postgres validates at statement time, so every rename of a user that owned at least one inline policy failed with an FK violation. The existing memory-store regression test missed it because the memory backend has no FK enforcement. Add an optional credential.UserRenamer interface plus a CredentialManager.RenameUser thin shim that returns (supported, err). Implement it on PostgresStore as an atomic in-transaction migration: INSERT the new users row by SELECT-copying from the old, UPDATE credentials.username and user_inline_policies.username to the new name (FK satisfied because both rows now exist), then DELETE the old row. ErrUserNotFound / ErrUserAlreadyExists are surfaced cleanly. Implement it on MemoryStore by re-binding store.users / store.accessKeys / store.inlinePolicies under the new name. Also fixes a small leak in DeleteUser, which was forgetting to drop the user's inline-policy bucket. EmbeddedIamApi.UpdateUser now calls RenameUser first; if the store implements the interface, that's the whole migration. If it doesn't (stores without FK enforcement), fall back to the previous list / get / put copy. Adds a focused test for MemoryStore.RenameUser that asserts the identity, the access-key index, and the inline policies all land under the new name.	2026-04-26 16:31:53 -07:00
Chris Lu	4f628ff4e5	fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908 ) (#9228 ) * fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908) buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY chunk upfront, then walked them with io.MultiReader. For a multipart SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob with 25+ chunks), N volume-server bodies sat live at once; chunks 1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent load the volume server's keep-alive logic closed those idle responses mid-flight, and the S3 client saw `unexpected EOF` partway through the GET. Truncated bytes hash to the wrong SHA-256, which is exactly the "Digest did not match" symptom Docker Registry reports in #8908 (and which persisted even after the per-chunk metadata fix in #9211 and the completion backfill in #9224). Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk, wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for the SSE-specific decryption setup. Per-chunk metadata is still validated UPFRONT so a malformed chunk fails fast without opening any HTTP connection -- the eager validation contract callers and tests rely on is preserved. The volume-server GET and the SSE-specific decrypt wrap, however, fire LAZILY: at most one chunk body is live at any time, regardless of object size. This commit applies the new pattern to buildMultipartSSES3Reader only; the SSE-KMS and SSE-C multipart readers retain their eager form for now and will be migrated in follow-up commits, since the same shape exists there too. Tests: - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract: zero chunks opened at construction, peak liveness == 1, all closed after drain. - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch (replaces ClosesAppendedOnError) asserts a malformed chunk in position N causes zero fetches for chunks 0..N -- the previous test pinned a weaker contract (cleanup after eager open). - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same reason: the fetch callback must NOT be invoked at all on a bad-IV chunk. - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts encrypted the way putToFiler writes them (shared DEK + baseIV, partOffset=0, post-completion global offsets) and walks them through buildMultipartSSES3Reader. * fix(s3api): stream multipart SSE-KMS chunks lazily Apply the same fix as the previous commit to createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata is validated upfront, but volume-server GETs fire lazily through lazyMultipartChunkReader. At most one chunk body is live at any time. This is the same eager-open-all-chunks shape that produced #8908's truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks were exposed to the same idle-keepalive failure mode under concurrent load. The wire format on disk is unchanged (same per-chunk metadata, same encrypted bytes, same object Extended attributes). Existing SSE-KMS multipart objects read back identically -- only when the volume-server GETs fire changes. * fix(s3api): stream multipart SSE-C chunks lazily Apply the same fix as the previous two commits to createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is validated upfront (IV decode, IV length check, non-negative PartOffset), but the volume-server GET and CreateSSECDecryptedReader- WithOffset wrap fire lazily through lazyMultipartChunkReader. At most one chunk body is live at any time. This is the same eager-open-all-chunks shape that produced #8908's truncated GETs for SSE-S3; SSE-C multipart objects with many chunks were exposed to the same idle-keepalive failure mode under concurrent load. The pre-existing TODO note about CopyObject SSE-C PartOffset handling is preserved verbatim. The wire format on disk is unchanged (same per-chunk metadata, same encrypted bytes); existing SSE-C multipart objects read back identically. After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS, SSE-C) share lazyMultipartChunkReader as their streaming engine. * test(s3): add Docker Registry-shape multipart SSE-S3 GET regression Pin the end-to-end fix for #8908 with a test that mirrors what Docker Registry actually does on pull: a 25-part * 5MB upload with bucket- default SSE-S3, then a full GET, then SHA-256 over the streamed body must match SHA-256 over the uploaded bytes. The eager-multipart-reader bug was specifically a streaming truncation under load: the response status was 200 with a Content-Length matching the object size, but the body short-circuited mid-stream because later chunks' volume-server connections had already been closed by keepalive. The hash check is the symptom Docker Registry surfaces ("Digest did not match"), so this is the most faithful regression we can pin without spinning up a registry. uploadAndVerifyMultipartSSEObject already byte-compares the GET body, but hashing on top is intentionally explicit -- it documents WHY the test exists, and matches the failure mode reported in the issue. * test(s3): add range-read coverage matrix across SSE modes and sizes Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB) single-chunk objects, with one ad-hoc range case per SSE mode and one 129-byte boundary-crossing case in TestSSEMultipartUploadIntegration. Nothing exercised: - Range reads on single-PUT objects whose content crosses the 8MB internal chunk boundary (medium size class). - Range reads on multipart objects whose parts each span multiple internal chunks (large size class) -- the shape #8908 originally surfaced for full-object GETs and the most likely site of any future regression in per-chunk IV / PartOffset plumbing for partial reads. - A consistent range-pattern set applied uniformly across SSE modes, so any divergence between modes (SSE-C uses random IV + PartOffset; SSE-S3/KMS use base IV + offset) is comparable at a glance. TestSSERangeReadCoverageMatrix introduces a parameterized matrix: modes: no_sse, sse_c, sse_kms, sse_s3 sizes: small (256KB single chunk), medium (12MB single PUT crossing one internal boundary), large (5x9MB multipart, ~10 internal chunks, every part itself spans an 8MB boundary) ranges: single byte at 0, prefix 512B, single byte at last, suffix bytes=-100, open-ended bytes=N-, whole object, AES-block boundary 15-31, mid straddling one internal boundary (medium+large), mid spanning many internal boundaries (large only) Per case it asserts: body bytes equal the expected slice, Content-Length matches the range length, Content-Range matches start-end/total, and the SSE response headers match the mode. The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's the remaining sse_kms subtests with a clear reason if the local server has no KMS provider configured -- the default `weed mini` setup lacks one; the Makefile target `test-with-kms` provides one via OpenBao. Other modes always run. Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against weed mini, sse_kms cleanly skipped. * test(s3): conform new test names to TestSSEIntegration so CI runs them The two tests added in the previous commits had names that did NOT match the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml use to discover SSE integration tests: - test/s3/sse/Makefile `test` target: TestSSE.Integration - test/s3/sse/Makefile `test-multipart`: TestSSEMultipartUploadIntegration - .github/workflows/s3-sse-tests.yml: ...\|.Multipart.Integration\|.RangeRequestsServerBehavior Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and the Docker-Registry-shape multipart regression in TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible to CI even though the underlying test setup (start-seaweedfs-ci using s3-config-template.json with the embedded `local` KMS provider) already has SSE-KMS configured. Renames: TestSSERangeReadCoverageMatrix -> TestSSERangeReadIntegration TestSSES3MultipartManyChunks_... -> TestSSEMultipartManyChunksIntegration Both names now match `TestSSE.Integration` (Makefile `test` target) and TestSSEMultipartManyChunksIntegration additionally matches `.Multipart.Integration` (CI's comprehensive subset). No behavior change; only the function names move. Verified locally against `weed mini` with s3-config-template.json: TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes (none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns, all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only fires for ad-hoc local setups that don't load any KMS provider; the project's standard test setup loads the local provider, so CI has full SSE-KMS range coverage. * fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch Addresses CodeRabbit review on PR #9228: in createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata was deserialized in the prep loop but the IV length was only validated later, inside CreateSSEKMSDecryptedReader, which runs from the wrap closure -- AFTER the chunk's volume-server fetch has already started. That weakens the new "reject malformed chunks before any fetch" contract for SSE-KMS specifically: a chunk with a missing/short/long IV would fire its HTTP GET, then fail mid-stream during decrypt. The fix moves the existing ValidateIV check into the prep loop, matching the SSE-S3 and SSE-C paths. Drive-by: extract the SSE-KMS prep loop into a free buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader, so the new contract is unit-testable without an S3ApiServer. The exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a thin caller, so behavior for production callers is unchanged. New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract: - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers missing IV, empty IV, short IV, long IV. Each case asserts both that an error is returned AND that the fetch callback is never invoked. - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch pins the analogous behavior when SseMetadata is nil on a chunk in position N: chunks 0..N-1 must not be fetched (the earlier eager implementation depended on a closeAppendedReaders cleanup path; the new contract is stronger -- nothing is opened in the first place). - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch covers the JSON-unmarshal failure branch. - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the documented sort-by-offset contract by recording the order in which fetch is invoked. All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit suite + the SSE integration suite (with the local KMS provider enabled via s3-config-template.json) continue to pass. * test(s3): address CodeRabbit nitpicks on range coverage matrix Three small follow-ups on the range-read coverage matrix from the previous commit, per CodeRabbit nitpicks on PR #9228: 1. Promote the body-length check from `assert.Equal` to `require.Equal` so a truncation regression -- the canonical #8908 failure mode -- aborts the subtest immediately. Previously the assertion logged a length mismatch and then `assertDataEqual` ran on differently-sized slices, producing a noisy byte-diff on top of the actual symptom. The redundant trailing `t.Fatalf` block becomes dead and is removed. 2. Broaden the SSE-KMS probe-skip heuristic. The probe previously produced the friendly "KMS provider not configured" message only for 5xx responses; KMS-misconfig surfaces also include 501 NotImplemented, 4xx KMS.NotConfigured, and error messages containing "KMS.NotConfigured" / "NotImplemented" / "not configured". The behaviour change is purely cosmetic (the caller t.Skip's on any non-empty reason either way) but the new diagnostic is more useful in CI logs. 3. Add `t.Parallel()` at the mode and size-class levels of the matrix. Each (mode, size) writes an independent object key under the shared bucket, with no cross-talk, so parallel execution is safe. Local wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%); the savings scale with chunk count and CI machine concurrency. Verified locally against `weed mini` with s3-config-template.json: - go test ./weed/s3api/ -count=1 PASS - TestSSERangeReadIntegration -v 112 PASS, 0 SKIP - TestSSEMultipartUploadIntegration etc. PASS * fix(s3api): tighten lazy reader error path; unify SSE IV validation Three CodeRabbit nitpicks on PR #9228: 1. lazyMultipartChunkReader: mark finished on non-EOF Read errors The Read loop's three earlier failure paths (chunk index past end, fetch error, wrap error) all set l.finished = true before returning. The non-EOF Read path -- where l.current.Read itself errors mid-chunk -- did not, leaving l.current/l.closer set and l.finished = false. A caller that retried Read after an error would re-enter the same broken stream instead of advancing or giving up. Set l.finished = true on non-EOF Read error so post-error state is consistent across all four failure sites; Close() (which the GetObjectHandler defers) still releases the chunk body. 2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths The previous commit moved SSE-KMS to the shared ValidateIV helper but left SSE-S3 and SSE-C with bespoke inline `len(...) != AESBlockSize` checks. All three are enforcing the same invariant; inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to ValidateIV too, with the same `<algo> chunk <fileId> IV` name convention. Error message wording shifts from "<algo> chunk X has invalid IV length N (expected 16)" to ValidateIV's "invalid <algo> chunk X IV length: expected 16 bytes, got N". The substring "IV length" is preserved across both, so the existing TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion is loosened to match either form. 3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering The test previously drove Read() to observe fetch-call order, but CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap the encrypted DEK -- unavailable in unit tests -- so the wrap closure failed on the first chunk and only one fetch was ever recorded. The test asserted only fetchOrder[0] == "c0", which is weaker than the comment promised. Switch to a static check: type-assert the returned reader to lazyMultipartChunkReader (same package so unexported fields are accessible) and inspect the prepared chunks slice directly. This pins the entire [c0, c1, c2] sort order in one place, doesn't depend on KMS, and runs in zero fetch calls. The fetch closure now asserts it is never invoked during preparation. All weed/s3api unit tests pass; integration suite (with KMS provider configured via s3-config-template.json) passes. test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe Two CodeRabbit comments on PR #9228, both about test/s3/sse/s3_sse_range_coverage_test.go: 1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration The test creates one bucket up front, then runs subtests that call t.Parallel() at the mode and size levels (added in 058cbf27 to cut wall time). t.Parallel() pauses each subtest and yields back to the parent. The parent's for loop finishes scheduling, the function returns, and the deferred cleanupTestBucket fires -- BEFORE any parallel subtest body has executed. The bucket gets deleted out from under the parallel subtests, which then race the cleanup and either fail with NoSuchBucket or, depending on lazy-deletion behaviour on the server side, mask other regressions because chunks happen to still be readable for a brief window. The local matrix passing prior to this commit was a server-side coincidence; the t.Cleanup contract is the right one for parent tests with parallel children, and switching to it is a one-line change. t.Cleanup runs after the test AND all its (parallel) subtests complete, so the bucket survives until every leaf subtest is done. 2. MINOR: tighten the SSE-KMS probe-skip heuristic The previous broadening (058cbf27) treated `code == 400` as "KMS provider not configured", on the theory that some servers return 4xx for KMS misconfig. That is too aggressive: a real misconfiguration in the SSE-KMS test request itself (bad keyID format, missing header) ALSO surfaces as a 400, and would silently t.Skip the SSE-KMS subtree in CI -- which is exactly the integration coverage the new TestSSERangeReadIntegration is supposed to add. Drop the 400 branch (and the redundant 501 match, since 501 >= 500 already covers it). Genuine "KMS.NotConfigured" / "NotImplemented" responses are still recognised via the string-match block immediately below, regardless of status code, so the friendly skip message survives for the cases where it actually applies. Verified locally against `weed mini` with s3-config-template.json: - go test ./weed/s3api/ PASS - TestSSERangeReadIntegration -v 113 PASS lines, 0 SKIP - TestSSEMultipartUploadIntegration etc. PASS	2026-04-26 16:31:42 -07:00
Jon E Nesvold	dc462a80d7	feat(credential/postgres): inline policies, mTLS and pgbouncer connection support (#9226 ) * feat(credential/postgres): mTLS + pgbouncer support, InlinePolicyStore implementation, upsert SaveConfiguration * fix(credential/postgres): add rows.Err() checks, inline policy tests, memory store LoadInlinePolicies * fix(credential/postgres): cast JSONB params to string for pgbouncer simple protocol * fix(credential/postgres): wrap tx.Commit errors with context * fix(credential/postgres): use any type for JSONB params to preserve SQL NULL for nil fields	2026-04-26 14:54:53 -07:00
Parviz Miriyev	f407bdaa36	fix(admin): use TLS-aware HTTP client for /dir/status fetch (#9227 ) fetchPublicUrlMap() in weed/admin/dash/cluster_topology.go uses a dedicated &http.Client{} that doesn't honor security.toml client TLS configuration, and hardcodes "http://" in the URL. When master is configured HTTPS-only ([https.master] set), every cluster topology cache refresh logs: NOTICE: http: TLS handshake error from <admin-ip>:<port>: client sent an HTTP request to an HTTPS server The function falls through to glog.V(1).Infof and returns nil, so the admin UI loses PublicUrl enrichment for data nodes. Cosmetic but noisy. Switch to util_http.GetGlobalHttpClient() whose Do() calls NormalizeHttpScheme(), which automatically rewrites http:// to https:// when [https.client] is enabled and presents the configured client cert. Preserve the 5-second timeout via context.WithTimeout(). Same pattern as weed/admin/handlers/file_browser_handlers.go, weed/server/master_server.go, weed/shell/command_volume_fsck.go.	2026-04-26 12:25:53 -07:00
Chris Lu	0716577ec8	fix(upload): rewind request body when retrying on connection reset (#9139 ) (#9222 ) * fix(upload): rewind request body when retrying on connection reset (#9139) When httpClient.Do() returned "connection reset by peer" or "use of closed network connection", upload_content retried with the same http.Request. But the body is a bytes.Reader the first attempt already consumed, so the retry sent 0 bytes and Go's transport surfaced "http: ContentLength=N with Body length 0". http.NewRequestWithContext populates req.GetBody for bytes.Reader bodies; use it to attach a fresh body before retrying. Reproduces the issue with a unit test (asserts both attempts see the same payload bytes); the test fails without the fix. upload: skip inner retry when body cannot be rewound Per review feedback: if req.GetBody is nil or returns an error, the inner retry would call Do(req) with an already-consumed body and the "connection reset" error would be replaced by the misleading "ContentLength=N with Body length 0" — the very symptom this PR set out to fix. Skip the inner retry on rewind failure and let the outer retriedUploadData loop reissue with a fresh request, and log when GetBody is unavailable for observability. * upload: log the actual transport error in the inner retry log line Per review feedback: the diagnostic glog at the top of the inner retry branch was logging postErr — the request-construction error from http.NewRequestWithContext, which is necessarily nil there because the function returns early at line 423 if it isn't. Operators were seeing "<nil>" instead of the transient transport error that triggered the rewind. Reference post_err so the connection-reset / closed-connection cause is actually visible.	2026-04-26 02:17:55 -07:00
Chris Lu	654292b57d	fix(volume): cap leveldb OpenFilesCacheCapacity per index DB (#9139 ) (#9223 ) * fix(volume): cap leveldb OpenFilesCacheCapacity per index DB (#9139) The leveldb opt.Options for NeedleMapLevelDb / Medium / Large never set OpenFilesCacheCapacity, so each leveldb instance defaulted to goleveldb's 500. On servers with thousands of volumes, that ceiling stacks across DBs and exhausts even high ulimits, starving WAL rotation: failed to write leveldb: open .../000006.log: too many open files CompactionTableSizeMultiplier=10 already keeps the SST count low, so a small per-DB cache is sufficient. Cap at 16 / 32 / 64 for the small / medium / large variants so per-DB FD usage is bounded. * storage: hoist leveldb FD-cap values into named constants Per review feedback: replace the inline 16/32/64 literals with LevelDb{,Medium,Large}OpenFilesCacheCapacity, and move the rationale (why 500 is too high per-DB on busy servers, what the tradeoff is) into a package-level comment so future readers see the memory vs. performance picture at the constant declaration instead of inline.	2026-04-26 02:15:15 -07:00
Chris Lu	525900dfe4	fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224 ) * fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion When a part of an SSE-S3 multipart upload lands with SseType=NONE on its chunks (e.g. a transient failure to apply SSE-S3 setup in PutObjectPart), the completed object inherits NONE-tagged chunks and detectPrimarySSEType then misses the chunked SSE-S3 encryption. The read path falls through to the unencrypted serve and GET returns ciphertext, producing the SHA mismatch reported in #8908. Recover at completion using the base IV and key data the upload directory recorded at CreateMultipartUpload: - extractMultipartSSES3Info validates upload-entry metadata up front and hard-fails completion if the base IV or key data are malformed; serializing chunk metadata we then could not decrypt is worse than rejecting the upload. - completedMultipartChunk re-derives a per-chunk IV from baseIV + chunk.Offset (matching what putToFiler would have written) and serializes per-chunk SSE-S3 metadata when the chunk has no tag. Existing per-chunk metadata is left alone; we cannot recover an already-derived IV from the upload-entry alone. The IV formula intentionally has no partNumber term: putToFiler hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption for every part, so each chunk's encryption IV is calculateIVWithOffset(baseIV, chunk.Offset_part_local). PartOffsetMultiplier is defined in s3_constants but is not consumed by the encryption path. Adopting (partNumber-1)PartOffsetMultiplier + chunk.Offset would produce IVs that fail to decrypt the bytes on disk - a stronger failure mode than the bug being fixed. Tests pin this: - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext runs the round trip across the encryption boundary: encrypt parts with CreateSSES3EncryptedReaderWithBaseIV (the call putToFiler uses), drop chunk metadata to reproduce #8908, backfill, decrypt with backfilled IV, assert plaintext intact. - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula constructs the IV the partNumber formula would produce and shows it does not decrypt the actual ciphertext. This commit covers the chunk-level recovery only. The companion fix for the object-level Extended attributes (SeaweedFSSSES3Key / X-Amz-Server-Side-Encryption) follows separately. fix(s3api): backfill canonical SSE-S3 attributes onto multipart object The previous commit ensures every chunk of an SSE-S3 multipart upload carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct read path can decrypt. The completed object's Extended map can still miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal look at: - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header detectPrimarySSEType reads on inline / small-object reads) - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by IsSSES3EncryptedInternal and by the read-path key lookup) When a part of the upload was written by a path that did not set those (the same #8908 race that produced the NONE chunks), copySSEHeadersFromFirstPart finds nothing to copy and the final entry ends up with only the multipart-init keys (SeaweedFSSSES3Encryption / BaseIV / KeyData). The read path then mis-detects the object as unencrypted. applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair from the multipart-init metadata in all three completion paths (versioned, suspended, non-versioned), only when the keys are missing so a healthy first part still wins. extractMultipartSSES3Info already ran in prepareMultipartCompletionState, so the data is reused without re-decoding. Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill, do-not-clobber, and nil-info no-op cases. * fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV (calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the adjusted IV to CreateSSEKMSDecryptedReader, which itself runs calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The offset was being applied twice for any chunk with a non-zero ChunkOffset, corrupting the keystream for range reads that cross multipart chunk boundaries. Pass the raw SSE-KMS key (with base IV and the original ChunkOffset field) into CreateSSEKMSDecryptedReader so the offset is applied exactly once, and remove the now-dead intra-block skip that was compensating for the double adjustment. Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment that decrypts the same ciphertext with a deliberately double-adjusted IV and asserts the output is corrupted, so any regression that re-introduces the double application fails the unit test. * test(s3): cover multipart SSE across chunk-spanning parts and ranges Adds an integration subtest "Multipart Parts Larger Than Internal Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that exercises the end-to-end S3 path for the bugs fixed in this branch: - Two-part multipart upload with each part larger than the 8MB internal SeaweedFS chunk, so each part itself spans multiple underlying chunks. - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default SSE-S3 - the four paths multipart parts can take through the SSE pipeline. - Each subtest does a full GET (verifying every byte and the response Content-Length / SSE response headers) plus a 129-byte range read straddling the 8MB internal chunk boundary, which is the path that produced the SSE-KMS double-IV corruption (fix in the previous commit) and the SSE-S3 chunk-tag loss (fix in the earlier commits). Factored the request shape behind multipartSSEOptions / uploadAndVerifyMultipartSSEObject so all four SSE flavors share the same upload+verify code; only the SSE-specific input/output configuration differs per subtest. * test(s3): abort orphan multipart uploads on test failure Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The helper used require.NoError after CreateMultipartUpload, UploadPart and CompleteMultipartUpload, so a failure in any of those (or in the later GET / range read on a still-incomplete upload) called t.Fatal without aborting the in-flight MPU, leaving an orphan upload in the bucket. Harmless in CI where the data dir is wiped on shutdown, but a real annoyance when iterating locally and a textbook AWS S3 caveat in production. Register a t.Cleanup that calls AbortMultipartUpload unless a "completed" flag was set right after a successful CompleteMultipartUpload. Use context.Background for the abort call since the parent ctx may already be cancelled at cleanup time, and t.Logf the abort error rather than failing the test so the original failure remains visible in the run output.	2026-04-25 23:06:37 -07:00
Chris Lu	5eead9409a	fix(admin): S3 Tables CSRF token + non-empty 409 status (#9221 ) * fix(admin): attach CSRF token to S3 Tables write requests Several POST/PUT/DELETE calls in s3tables.js were sent without an X-CSRF-Token header while the corresponding handlers in weed/admin/dash/s3tables_management.go enforce CSRF via requireSessionCSRFToken, so authenticated users hit "invalid CSRF token" on actions like creating a table bucket (#9220), updating policies, and managing tags. Add an s3tWriteHeaders helper that pulls the token from the existing csrf-token meta tag and use it on every write to /api/s3tables/buckets, /bucket-policy, /tables, /table-policy, and /tags. The Iceberg-page write paths already attached the token and are unchanged. Fixes #9220 * fix(admin): map BucketNotEmpty/NamespaceNotEmpty to 409 for S3 Tables DELETE on a non-empty table bucket or namespace returned HTTP 500 because s3TablesErrorStatus didn't list ErrCodeBucketNotEmpty or ErrCodeNamespaceNotEmpty in its conflict case, even though the backend handler emits them with 409 Conflict (matching AWS S3 Tables). Add both codes to the existing conflict mapping. * refactor(admin): route Iceberg S3 Tables writes through s3tWriteHeaders Iceberg namespace/table create and Iceberg table delete were still hand-rolling CSRF headers. Replace those blocks with the existing s3tWriteHeaders() helper so every S3 Tables write uses the same code path. Drop the now-unused csrfTokenInput.value population in initIcebergNamespaces and initIcebergTables (the templ hidden inputs have no server-rendered value, and nothing reads the input now that the JS reads the token from the meta tag via getCSRFToken()).	2026-04-24 22:48:41 -07:00
Chris Lu	a14cbc176b	debug(kafka): add restart flake diagnostics	2026-04-24 15:02:07 -07:00
Chris Lu	f1f720f5da	fix(master): register EC shards per physical disk on full heartbeat sync (#9212 ) (#9219 ) * refactor(types): add DiskId type for physical-disk identifiers Names the uint32 physical-disk index that volume servers carry in VolumeEcShardInformationMessage / VolumeInformationMessage, so EC shard tracking that needs to distinguish disks within a DataNode can use a dedicated type instead of an untyped uint32. No behaviour change. * fix(master): register EC shards per physical disk on full heartbeat sync (#9212) When a volume's EC shards are spread across multiple physical disks on the same volume server (common after ec.balance / ec.rebuild on multi-disk nodes), the volume server emits one VolumeEcShardInformationMessage per (disk, volume) in its heartbeat. The master's DataNode.UpdateEcShards was building a `map[VolumeId]EcVolumeInfo` with last-write-wins, and doUpdateEcShards then overwrote `disk.ecShards[vid]` once per message, so all but the final disk's shards were silently dropped. Only the topology-global ecShardMap (built via RegisterEcShards in a per-message loop) stayed correct, which hid the problem from `topo.LookupEcShards` but broke everything that reads the DataNode/Disk view — volume.list, admin UI, ec.rebuild dry-run ("only 6 shards, skipping"), and `DiskInfo.EcShardInfos` which the shell's ec.balance / ec.rebuild planners group by `eci.DiskId`. Change the shape of `Disk.ecShards` from map[VolumeId]EcVolumeInfo to map[VolumeId]map[types.DiskId]EcVolumeInfo so every physical disk keeps its own entry. UpdateEcShards aggregates incoming messages by (vid, diskId) rather than vid alone; Add/Delete/ HasVolumesById and HasEcShards consult the nested map; doUpdateEcShards rewrites the nested structure from the aggregated map. Per-physical-disk attribution survives through DataNode.ToDataNodeInfo -> DiskInfo.EcShardInfos, matching the wire format the volume server produces and what downstream admin tooling expects. Delta sync (AddOrUpdateEcShard / DeleteEcShard) already merged via ShardsInfo.Add, so this only affects the full-sync path that runs on heartbeat reconnect. Adds data_node_ec_multi_disk_test.go with two regression tests that fail on pre-fix master: - TestEcShardsAcrossMultipleDisksOnSameNode: volume 15 spread over 3 disks (matches the bug report's volume-2 row); asserts every shard visible via LookupEcShards, DataNode.GetEcShards, and ToDataNodeInfo's per-disk EcShardInfos entries. - TestEcShardsAfterRestartHeartbeat: minimal 2-disk full sync case. fix(topology): tighten locking around EC shard map access Addresses review comments on #9219: * DataNode.UpdateEcShards now holds dn.Lock for the full read-diff-write cycle, matching UpdateVolumes' model, so concurrent heartbeats can no longer interleave their getOrCreateDisk / UpAdjustDiskUsageDelta updates with each other. Introduces a private getEcShardsLocked helper for reads under the held lock; renames doUpdateEcShards to doUpdateEcShardsLocked for the same reason. * DataNode.HasEcShards now takes each disk's ecShardsLock while reading disk.ecShards, closing a pre-existing map race with concurrent Add/Delete/Update writers. * doUpdateEcShardsLocked takes each disk's ecShardsLock around the reset-and-rewrite so readers (GetEcShards, HasEcShards) see a consistent map state rather than a partially-rebuilt one. * Disk.GetEcShards' slice-capacity hint now accounts for the nested per-physical-disk entries (sum of inner lengths) instead of underestimating by the unique-volume count.	2026-04-24 14:01:09 -07:00
Chris Lu	d65c568cbb	fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests (#9218 ) * fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests DeserializeSSES3Metadata does not require an IV, and a corrupted or legacy chunk without one would have flowed into cipher.NewCTR and panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes before decryption, closing the current and any already-appended chunk readers on error. Factor the per-chunk decryption loop out of createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader so it can be driven with a mock chunk fetcher, and add tests covering: the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order chunks) to lock in the fix from #9211; missing-IV and short-IV metadata rejection without panic; and reader cleanup when a later chunk fails. * address review: sort chunks copy; close encryptedStream on error - buildMultipartSSES3Reader now sorts a copy of the chunks slice so callers do not observe entry.Chunks reordered (other code paths, e.g. ETag computation, can rely on the original order). - createMultipartSSES3DecryptedReaderDirect now closes encryptedStream on the error path from buildMultipartSSES3Reader. All current callers pass nil, but this keeps cleanup symmetric with the success path. - Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the input slice is not mutated. * address review: defer single close; extend chunk-copy + IV-guard pattern - createMultipartSSES3DecryptedReaderDirect: collapse the duplicated encryptedStream.Close() calls into a single nil-guarded defer so the error and success paths share cleanup. - createMultipartSSECDecryptedReaderDirect, createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks instead of mutating the caller's slice, matching the SSE-S3 helper. - createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length before handing it to cipher.NewCTR; a base64-decoded empty or short IV from malformed/corrupt metadata would otherwise panic. - SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls ValidateIV before cipher.NewCTR. Note recorded in the sort comment. * address review: close appended readers on SSE-C/SSE-KMS error paths createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk reader on error and leaked any chunk readers already appended to the local readers slice, mirroring the leak previously fixed in the SSE-S3 helper. Add the same closeAppendedReaders() closure pattern to both functions and invoke it on every error return inside the loop so failed requests do not leak volume-server HTTP connections. * address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment - Move encryptedStream.Close() to a nil-guarded defer at the top of createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on every return path (including error returns from inside the per-chunk loop), mirroring the SSE-S3 helper. - In buildMultipartSSES3Reader, iterate sortedChunks directly instead of reassigning chunks = sortedChunks.	2026-04-24 13:59:23 -07:00
Chris Lu	fe1d7a404d	fix(iam): substitute dynamic jwt:/saml:/oidc: claim variables in policies (#9217 ) * fix(iam): expand arbitrary jwt:/saml:/oidc: claim variables in policies The policy engine gated variable substitution on a fixed allowlist (jwt:sub, jwt:iss, jwt:aud, jwt:preferred_username), so patterns like arn:aws:s3:::softs/${jwt:project_path}/* were passed through as literals and never matched the requested resource. Dynamic claims from OIDC providers (e.g. GitLab CI's project_path / namespace_path) could not be used to scope policies. Allow any jwt:/saml:/oidc: prefixed variable to be substituted when the claim is present in RequestContext. These values originate from a cryptographically verified identity token (the STS session JWT or federated assertion), and the claim names are controlled by the trusted identity provider, so the dynamic prefix is safe. Missing claims keep the placeholder intact so the statement still fails to match. Numeric JWT claims (JSON-decoded as float64) are now stringified so patterns like ${jwt:project_id} work the same as string claims. Fixes #9214 * fix(iam): cover all integer widths in claim stringification Address PR review: stringifyClaimValue only handled int/int32/int64 on the signed side and nothing on the unsigned side, so int8, int16, uint, uint8, uint16, uint32, and uint64 claim values fell through to the default branch and the placeholder was left unsubstituted. JSON's generic decoder produces float64/json.Number for numbers, but RequestContext can also be populated from typed sources (custom providers or internal code), so cover all common integer widths - signed and unsigned - explicitly. Extend TestStringifyClaimValue to assert each supported type.	2026-04-24 13:08:24 -07:00
os-pradipbabar	8815844278	fix(s3api): correct SSE-S3 decryption key handling in multipart uploads (#9211 ) * fix(s3api): correct SSE-S3 decryption key handling in multipart uploads * fix(s3api): preallocate readers and close on error in SSE-S3 direct path Address review feedback on createMultipartSSES3DecryptedReaderDirect: preallocate the readers slice with the known chunk count, and close any already-appended chunk readers on error returns so failed requests do not leak volume-server HTTP connections. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-24 12:00:29 -07:00
Lisandro Pin	93247d6de4	Export REST file_{read,write}_failures metrics on volume servers (#9215 ) * Export gRPC `file_{read,write}_failures` metrics on volume servers. Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. * Export REST `file_{read,write}_failures` metrics on volume servers.	2026-04-24 11:45:21 -07:00
dependabot[bot]	352ffdffe1	build(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /seaweed-volume (#9216 ) build(deps): bump rustls-webpki in /seaweed-volume Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.10 to 0.103.13. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-24 11:44:20 -07:00
Lars Lehtonen	29e14f89f1	fix(weed/command) address unhandled errors (#9208 ) * fix(weed/command) address unhandled errors * fix(command): don't log graceful-shutdown sentinels; plug response-body leak - s3: Serve on unix socket treated http.ErrServerClosed as fatal; now excluded like the other Serve/ServeTLS paths in this file. - mq_agent, mq_broker: filter grpc.ErrServerStopped so clean shutdown doesn't log as an error. - worker_runtime: the added decodeErr early-continue skipped resp.Body.Close(); drop it since the existing check below already surfaces the decode error. - mount_std: the pre-mount Unmount commonly fails when nothing is mounted; demote to V(1) Infof. - fuse_std: tidy panic message to match sibling cases. * fix(mq_broker): filter grpc.ErrServerStopped on localhost listener The localhost listener goroutine logged any Serve error unconditionally, which includes grpc.ErrServerStopped on graceful shutdown. Match the main listener's check so clean stops don't surface as errors. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 22:15:05 -07:00
Chris Lu	88c2f3c34d	fix(iam): accept bare "" resource in PutUserPolicy (#9209 ) (#9210 ) AWS IAM treats a bare "" in a statement's Resource as "any resource", but the embedded IAM resource parser required a 6-segment S3 ARN and silently skipped anything else. With a policy like {Action: "s3:", Resource: ""}, every resource was dropped and the statement produced no actions, so PutUserPolicy rejected the document with "no valid actions found in policy document". Short-circuit Resource == "" to the same full-wildcard path that "arn:aws:s3:::" already takes.	2026-04-23 22:14:41 -07:00
Chris Lu	da2e90aefd	fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207 ) * fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on.	2026-04-23 19:17:35 -07:00
Chris Lu	a0be40e070	Merge branch 'master' of https://github.com/seaweedfs/seaweedfs	2026-04-23 16:25:12 -07:00
Chris Lu	b94ad82472	fix(test): stabilize ConcurrentLockContention; warn on coherence drift TestPosixFileLocking/ConcurrentLockContention failed in CI (run 24857323067) with ENOENT when re-opening the file after all 8 workers had successfully written and closed. The 20s openWithRetry budget was exhausted, pointing at a real but unproven metaCache/parent-cache coherence issue in the mount under bursts of concurrent Release. Test: hold the initial fd open for the whole subtest; use it for the post-workers Sync() and the verification read. Workers still exercise the concurrent-flock invariant and per-record write correctness; the re-open path is no longer load-bearing. On Eventually failure, dump ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future recurrence has state to debug from. Drop the darwin-only ENOENT t.Skip branches that hid this same flake. Mount: in weedfs.lookupEntry, when returning ENOENT from the "parent cached but child missing" branch, log at Warningf instead of V(4) when the kernel is still tracking this path's inode. That combination is the smoking-gun signal for cache drift and is rare enough in normal use not to spam the log.	2026-04-23 15:57:35 -07:00
dependabot[bot]	cd5004cfbd	build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 in /test/kafka (#9204 ) build(deps): bump github.com/Azure/go-ntlmssp in /test/kafka Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-23 15:02:28 -07:00
dependabot[bot]	5cbcfd311c	build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 (#9205 ) Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-23 15:02:17 -07:00
Chris Lu	76f361fa77	fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202 ) (#9206 ) * fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) With `global.seaweedfs.enableSecurity=true` and the default `s3.httpsPort=0`, the chart was unconditionally passing `-cert.file` / `-key.file` to the S3 frontend. In `weed/command/s3.go`, when `tlsPrivateKey != ""` and `portHttps == 0`, the server promotes its main `-port` (8333 by default) into an HTTPS listener. The pod's readiness / liveness probes still use `scheme: HTTP`, so every kubelet probe produces http: TLS handshake error from <node-ip>:<port>: client sent an HTTP request to an HTTPS server in the pod log, as reported in #9202. `enableSecurity=true` is supposed to activate security.toml / gRPC mTLS, not silently flip the S3 HTTP port to HTTPS. Move the `seaweedfs.s3.tlsArgs` include inside the `if httpsPort` guard in all three templates that wire up an S3 frontend (standalone S3 deployment, filer with S3 sub-server, all-in-one deployment). The TLS cert args are now emitted only when the user explicitly opts into an HTTPS port; the main `-port` stays HTTP so probes work. Also add a regression test to `.github/workflows/helm_ci.yml` that renders all three templates with and without `httpsPort` and asserts the cert/key/ `-port.https` args are emitted together or not at all. * test(helm): add bash -n parse check to the S3 TLS-gating regression test Addresses gemini-code-assist review comment on #9206 flagging a potential "dangling backslash" shell-syntax risk in the rendered all-in-one command script when httpsPort is set but most S3/SFTP args are defaulted off. In practice bash -n accepts a trailing `\<newline><EOF>` (it's line-continuation to an empty line), so no current rendering is broken. Locking that contract down in CI so a future helper change that leaves a dangling backslash — or any other shell-syntax regression in the rendered command — fails loudly instead of silently shipping broken pods.	2026-04-23 15:00:07 -07:00
Chris Lu	3d39324bc1	fix(nfs): make Linux `mount -t nfs` work without client workaround (#9199 ) (#9201 ) * fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199) The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and does not register with portmap. Linux mount.nfs queries portmap on port 111 first, so the plain `mount -t nfs host:/export /mnt` form failed with "portmap query failed" / "requested NFS version or transport protocol is not supported" against a default `weed nfs` deployment. - Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses. Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port. - New CLI flag `-portmap.bind` (empty, disabled by default) to opt into the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE and must not collide with a system rpcbind. - Extended `weed nfs -h` help with the two supported ways to mount from Linux (client-side portmap bypass, or server-side `-portmap.bind`). - Startup log now prints a copy-pasteable mount command tailored to whether portmap is enabled. Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP round-trip against the real listener. Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0` the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock host:/export /mnt`) now succeeds and both read and write work. * fix(nfs): harden portmap responder per review feedback (#9201) Addresses three review findings on the portmap responder: - parseRPCCall: validate opaque_auth length against the record limit before applying the XDR 4-byte padding, so a near-uint32-max authLen can no longer overflow (authLen + 3) and bypass the bounds check. (gemini-code-assist) - serveTCP/Close: track live TCP connections and evict them on Close() so shutdown does not block on idle clients waiting for the read deadline to trip. serveTCP also no longer tears the listener down on a non-fatal Accept error (e.g. EMFILE); it logs and retries after a small back-off. Replaces the atomic.Bool closed flag with a mutex-guarded one so closed, conns, and the shutdown transition stay consistent. (coderabbit, minor) - handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s in-flight) so a peer that opens the privileged port 111 and stalls cannot pin a goroutine indefinitely. (coderabbit, major) Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP connection idle and asserts Close() returns within 2s (well under the 30s idle deadline) and that the client sees the eviction. All existing tests still pass, including under -race. * fix(nfs): keep portmap UDP responder alive on transient read errors (#9201) - serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and continue instead of returning. Matches how serveTCP now treats non-fatal Accept errors so a transient network blip doesn't take UDP portmap down until restart. (coderabbit) - Rename portmapAcceptBackoff -> portmapRetryBackoff now that both paths use it. - pmapProcDump: fix the pre-allocation capacity to match the actual encoding (20 bytes per entry + 4-byte terminator), replacing the old over-estimate of 24 per entry. No behavior change; just documents intent. (coderabbit nit) * docs(nfs): clarify encodeAcceptedReply body semantics (#9201) The prior comment said body is "nil when the accept_stat is itself an error", which was misleading: the PROG_MISMATCH branch already passes an 8-byte mismatch_info body. Rewrite to enumerate which error accept_stat values omit the body and call out PROG_MISMATCH as the exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit) * fix(nfs): make portmap retry backoff interruptible by Close() (#9201) serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a non-fatal listener error. If Close() races in during that sleep, the goroutine can't be interrupted, so Close() has to wait out the remaining backoff before wg.Wait() returns. Add a done channel that Close() closes once, and replace both time.Sleep calls with a select on ps.done + time.After. The window was tiny in practice but the select makes shutdown strictly bounded by Close()'s own work. (coderabbit nit)	2026-04-23 13:53:53 -07:00
FQHSLycopene	20f4fd9985	fix(storage): use ceil division for EC shard slots in maxVolumeCount (#9196 ) * fix(storage): use ceil division for EC shard slots in maxVolumeCount * fix(topology): use ceil division for EC shard slots consistently Applies the same ceiling-division formula used in store.go to the four remaining master-side sites that computed volume-slots consumed by EC shards with off-by-one approximations: - disk.go ToDiskInfo / Disk.ToDiskInfo used (n+1)/d, which under-counts slots for non-multiples of DataShardsCount, over-reporting FreeVolumeCount. - DiskUsageCounts.FreeSpace and NodeImpl.AvailableSpaceFor subtracted n/d + 1, which over-counts slots at multiples of DataShardsCount, under-reporting free space (and suppressing volume growth on nodes that still had room). All four now use (n + DataShardsCount - 1) / DataShardsCount, matching store.go:393, store.go:810, and command_ec_decode.go:422. * refactor(topology): extract ecShardSlots helper Deduplicates the (n + DataShardsCount - 1) / DataShardsCount ceiling expression now used by ToDiskInfo, DiskUsageCounts.FreeSpace, Disk.ToDiskInfo, and AvailableSpaceFor. Addresses PR review feedback. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 13:52:58 -07:00
faspix	0fcd5173be	fix(admin): use basePath for API fetches when urlPrefix is set (#9197 ) * fix(admin): use basePath for API fetches when urlPrefix is set * fix(admin): drop duplicate iam-utils script on Groups page * fix(admin): route topics page fetches through basePath The Topics page missed two fetch() calls that still used root-relative URLs, so create-topic and view-details still broke when -urlPrefix was set. --------- Co-authored-by: Maksim Babkou <maksim.babkou@innovatrics.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 11:55:07 -07:00
Chris Lu	749430dceb	fix(filer.meta.tail): include extended metadata in Elasticsearch docs (#9200 ) * fix(filer.meta.tail): include extended metadata in Elasticsearch docs The -es sink flattened only the FUSE attributes, so xattrs (including S3 user metadata like X-Amz-Meta-) never reached Elasticsearch. Add an Extended field and convert map[string][]byte to map[string]string so the values index as text; non-UTF-8 values fall back to base64. Addresses #9190 follow-up. fix(filer.meta.tail): prefix base64-encoded extended values with "base64:" Addresses review feedback: a plain UTF-8 xattr and a base64 fallback are otherwise indistinguishable to a consumer reading the ES doc.	2026-04-23 11:54:08 -07:00
Chris Lu	036191c78a	Merge branch 'master' of https://github.com/seaweedfs/seaweedfs	2026-04-23 11:09:59 -07:00
Chris Lu	34b236acfa	test(s3api): look up NewUser by name in CreateAccessKey collision test The memory credential store backs LoadConfiguration with a map, so the identity order is not stable across a save/load round trip. Indexing Identities[1] intermittently pointed at the owner identity and produced a spurious credential leak.	2026-04-23 11:09:17 -07:00
steve.wei	1a7ab2ea82	fix(upload): keep Content-MD5 on 204 unchanged writes (#9198 ) Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved.	2026-04-23 10:59:59 -07:00
Chris Lu	ae93f87a46	adjust logo	2026-04-23 10:05:51 -07:00
Chris Lu	6e950e0e7e	docs(note): add production-setup slide deck Marp-based markdown deck walking through the three-layer production topology (masters, volumes, filers + DB, S3 gateways, admin + workers) plus an erasure-coding note. Uses the object-store-layout diagram on the overview slide. Makefile renders PDF/HTML/PPTX via marp-cli.	2026-04-23 02:36:58 -07:00
dependabot[bot]	ede766645a	build(deps): bump github.com/jackc/pgx/v5 from 5.9.0 to 5.9.2 (#9194 ) Bumps [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) from 5.9.0 to 5.9.2. - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](https://github.com/jackc/pgx/compare/v5.9.0...v5.9.2) --- updated-dependencies: - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-22 18:33:15 -07:00
Chris Lu	592d6d6021	fix(filer/remote): keep re-cache work alive past caller cancellation (#9174 ) (#9193 ) * fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot finish before the S3 gateway's initial cache wait elapses. When it does, the gRPC ctx cancellation cascades into the filer's chunk downloads, the error path calls DeleteUncommittedChunks on every chunk already written, and the next retry starts over. boto3 splitting the GET into concurrent ranges (or any client tear-down on first failure) shortens the window between retries, so the loop never converges. Detach the caller's ctx with context.WithoutCancel before invoking the singleflight work so the download runs to completion regardless of client cancellations. Subsequent waiters — via the in-flight singleflight, or a fresh retry landing after completion — observe the cached entry and stream normally. Same detach pattern is used in filer_server_handlers_write.go:53 and volume_server_handlers_write.go:51. * simplify rationale comment * switch to DoChan so handler can return on caller cancel Do keeps the handler goroutine blocked for the full detached download even after the client is gone. DoChan lets the handler select on ctx.Done() and exit immediately; the singleflight goroutine continues on bgCtx and the next request either joins it or finds the entry cached.	2026-04-22 17:56:15 -07:00
Chris Lu	f438cc3544	fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184 ) (#9186 ) * test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers.	2026-04-22 16:47:01 -07:00
Chris Lu	628363c4a6	fix(erasure_coding): surface replica delete failures from EC task (#9184 ) (#9187 ) * test(erasure_coding): reproduce #9184 deleteOriginalVolume swallowing errors ErasureCodingTask.deleteOriginalVolume logs a warning when any replica VolumeDelete fails and then returns nil, so the EC task reports success to the admin even when a source replica survives. That stale replica lets a later detection scan re-propose the same volume and, once retried, drives the mounted-shard-truncation corruption that issue 9184 also describes. Reproducer: wire one reachable replica (succeeds) and one unreachable replica (fails) and assert the function currently returns nil. After the fix the function must surface the replica failure so the task is retried rather than marked done, and this test needs to be inverted. * fix(erasure_coding): surface replica delete failures from EC task ErasureCodingTask.deleteOriginalVolume previously logged a warning and returned nil when any VolumeDelete against a source replica failed. The EC task therefore reported overall success to the admin even when a source replica stayed on disk, which let a later detection scan propose a duplicate EC encoding of the same volume. The retry then walked the ReceiveFile path against servers that already had mounted EC shards for the volume, truncating the live shard files in place (the other half of #9184). This change returns an error describing the per-replica failures after the best-effort delete pass, so the task is marked failed instead of silently moving on. Successful deletes are still applied (per-replica progress is preserved); only the final return changes. When combined with the ReceiveFile mount-safety check, a stuck original replica now produces loud, actionable failures instead of silent corruption. Tests: - TestDeleteOriginalVolumeSurfacesReplicaFailures: asserts an error is returned and names the unreachable replica, while the reachable replica still gets deleted. - TestDeleteOriginalVolumeSucceedsWhenAllReplicasReachable: pins the happy path.	2026-04-22 16:02:51 -07:00
Lars Lehtonen	8ae07e2a3f	chore(weed/filer/redis3): prune unused test functions (#9192 )	2026-04-22 15:34:05 -07:00
Jon E Nesvold	c6302fcb54	feat(iam): allow caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey (#9172 ) * feat(iam): support caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey Both IAM implementations (standalone and embedded) now check for caller-supplied AccessKeyId and SecretAccessKey form parameters before generating random credentials. If provided, the caller-supplied values are used. If empty, random keys are generated as before. This enables programmatic identity provisioning where the caller needs to control the S3 credentials. Backward-compatible: no behavior change for callers that omit these parameters. * refactor(iam): extract shared caller-supplied credential validation Move the AccessKeyId/SecretAccessKey format checks and the in-memory collision scan into weed/iam so the standalone IAM API, the embedded IAM in s3api, and the admin dashboard all enforce the same rules. - ValidateCallerSuppliedAccessKeyId: 4-128 alphanumeric (rejects SigV4-breaking characters like '/' and '='). - ValidateCallerSuppliedSecretAccessKey: 8-128 chars. - FindAccessKeyOwner: scans identities and service accounts and returns the owning entity type + name for debug logging, without exposing the owner in caller-facing error messages. The admin dashboard previously only length-checked caller-supplied keys; it now enforces the same alphanumeric rule, which matches what SigV4 actually accepts anyway. * fix(iam): reject partial caller-supplied AccessKeyId/SecretAccessKey Previously, if a caller supplied only one of AccessKeyId or SecretAccessKey, CreateAccessKey logged a warning and auto-generated the missing half. That silently returns a credential the caller did not fully choose, which is surprising and easy to miss in a response they expected to echo back their input. Return ErrCodeInvalidInputException instead: either both are supplied or neither is. Updates the mixed-supply tests in weed/iamapi and weed/s3api to assert the rejection. * chore(iam): centralize and broaden sensitive form redaction DoActions and ExecuteAction both had an inline loop that redacted SecretAccessKey from their debug-level request log. Replace the two copies with iam.RedactSensitiveFormValues, backed by an explicit sensitive-keys set. The set now also covers Password, OldPassword, NewPassword, PrivateKey, and SessionToken. None of those parameters are used by today's IAM actions, but naming them here makes the log-safety guarantee survive future additions such as LoginProfile / STS. * test(iam): cover the upper length bound for CreateAccessKey TestCreateAccessKeyBoundary / TestEmbeddedIamCreateAccessKeyBoundary only exercised the 3/4-char lower edge. Add cases for 128 (accepted) and 129 (rejected) for AccessKeyId, plus 7 / 128 / 129-char cases for SecretAccessKey, so both ends of the validator are locked in at the handler level (the pure validators in weed/iam already cover this). * fix(s3api/iam): verify user existence before RNG and collision scan In the embedded IAM CreateAccessKey, the user lookup ran last: a request for a non-existent user still walked the whole identity / service-account list for collisions and, if no caller-supplied keys were present, generated fresh random credentials with crypto/rand before the NoSuchEntity error finally surfaced. Reorder: validate inputs, then find the target identity, then do the collision scan, then generate keys. A missing user now fails fast and consumes no entropy, and the handler returns NoSuchEntity instead of a misleading EntityAlreadyExists when both the user is missing and the supplied AccessKeyId happens to collide with another identity's key. Add TestEmbeddedIamCreateAccessKeyRejectsMissingUser to lock in the "no mutation on unknown user" guarantee. The standalone iamapi CreateAccessKey intentionally keeps its pre-existing "create-or-attach" semantics where a missing user is implicitly provisioned — that is a behavior change beyond the scope of this PR. * test(iam): tighten collision leak assertion and cover 8-char secret - Rename the collision-owner identity in TestCreateAccessKeyRejectsCollision (both iamapi and the embedded s3api test) from "existing" / "ExistingUser" to "ownerAlpha". The old assert.NotContains check was effectively a no-op because the error message never contained those substrings; a distinctive name shared with no part of the expected error body makes the leak guard actually meaningful if the wording ever drifts. The embedded test also adds a NotContains assertion that was previously missing entirely. - Add an explicit 8-char SecretAccessKey pass case to both boundary tests so the lower edge of the validator is locked in at the handler level alongside the 7 / 128 / 129-char cases. * fix(iamapi): enforce both-or-none before the collision lookup In the standalone IAM CreateAccessKey, FindAccessKeyOwner ran before the partial-credential check. If a caller supplied only AccessKeyId and it happened to collide with an existing key, the response was EntityAlreadyExists instead of the more fundamental InvalidInput for omitting SecretAccessKey — wrong error class, and leaked the fact that the probed key is already in use. Swap the order: validate both-or-none first, then do the collision scan. Matches the embedded IAM path and AWS behavior. Add a case to TestCreateAccessKeyRejectsPartialSupply that combines partial supply with a collision to lock in the ordering. * fix(admin): reject partial caller-supplied AccessKey/SecretKey The admin dashboard path silently generated the missing half when a caller supplied only one of AccessKey or SecretKey, while the IAM API and embedded IAM paths now reject this. Align the three: if exactly one is provided, return ErrInvalidInput. Also simplifies the generator block — either both are provided or neither is, so there is no mixed path to handle. * test(s3api/iam): guard dereferences in caller-supplied-keys test TestEmbeddedIamCreateAccessKeyWithCallerSuppliedKeys dereferenced AccessKeyId/SecretAccessKey/UserName and indexed Identities[0].Credentials[0] without first verifying shape, so any future regression that returns a partial response or skips the config mutation would panic mid-assertion instead of failing with a clear message. Add require.NotNil on the response pointers and require.Len on the identities/credentials slices before the asserts. test(iamapi): exercise the service-account branch of the collision check FindAccessKeyOwner scans both Identities[].Credentials and ServiceAccounts[].Credential, but TestCreateAccessKeyRejectsCollision only covered the identity branch. Split the test into two subtests — one per branch — so a future refactor that drops the service-account scan (or mutates the existing credential) trips a failure. Also asserts the existing service-account credential is not mutated and no credential is attached to the target identity on rejection. * test(iam): isolate 129-char secret subcase from prior credential In both TestCreateAccessKeyBoundary (iamapi) and TestEmbeddedIamCreateAccessKeyBoundary (s3api), the 129-char SecretAccessKey subcase reused the "validkey" AccessKeyId that the preceding 8-char subcase had just persisted into the config. The test still asserted the right outcome because the handler validates secret length before running the collision scan — but if the two checks ever swap, the subcase would pass (or fail) for the wrong reason. Reset the in-memory credentials before the 129-char subcase, matching the pattern already used by the 3/128/129-char AccessKeyId and 7-char secret subcases. No behavior change; purely test isolation. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-22 12:35:55 -07:00
Lisandro Pin	fff243d463	Export gRPC `file_{read,write}_failures` metrics on volume servers. (#9177 ) Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-22 11:22:21 -07:00
Chris Lu	cb882ced46	fix(test): retry ENOENT in fcntl lock subprocess helper TestPosixFileLocking/FcntlReleaseOnClose was flaky because the subprocess spawned by startLockHolder occasionally saw ENOENT when opening a file the parent had just created on the FUSE mount. Retry on ENOENT (matching the existing openWithRetry pattern used in testConcurrentLockContention) so the subprocess waits for the mount's dentry state to propagate before reporting the lock acquire as failed.	2026-04-22 10:33:21 -07:00
Chris Lu	c4e1885053	fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184 ) (#9185 ) * test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested.	2026-04-22 10:30:13 -07:00
Chris Lu	0f5e99f423	fix(filer.meta.tail): fail fast when -es is used without elastic build tag (#9191 ) fix(filer.meta.tail): error instead of silently dropping events when -es is used without elastic build tag The default chrislusf/seaweedfs image builds without the `elastic` build tag, so sendToElasticSearchFunc was a no-op that returned a function discarding every event. Users passing -es saw the subscription wire up in filer logs but nothing ever reached Elasticsearch. Return an error explaining the binary wasn't built with ES support and pointing at the build flag. The caller already prints the error and exits, so users now get an immediate, actionable message. Fixes #9190	2026-04-22 09:44:43 -07:00
dependabot[bot]	1220468a33	build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 in /test/kafka (#9189 ) build(deps): bump github.com/rclone/rclone in /test/kafka Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5. - [Release notes](https://github.com/rclone/rclone/releases) - [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md) - [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5) --- updated-dependencies: - dependency-name: github.com/rclone/rclone dependency-version: 1.73.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-22 09:27:30 -07:00
Chris Lu	be9996962d	fix(test): avoid port collision between master gRPC and volume ports AllocateMiniPorts(1) reserved masterPort and masterPort+GrpcPortOffset by holding listeners open, but closed them on return. The subsequent AllocatePorts call bound 127.0.0.1:0, so the OS could immediately reuse the just-released mini gRPC port as a volume port — causing the volume server to fail at bind time with "address already in use". Introduce AllocatePortSet(miniCount, regularCount) that holds every listener open until the full set is chosen, and route the five volume test cluster builders through it.	2026-04-21 23:33:57 -07:00
Chris Lu	96e5fea08e	test(catalog_spark): bound weed shell invocation with 30s timeout createTableBucket ran `weed shell` via exec.Command with no deadline. When the shell's first command retries on a transient master connection blip, the trailing `exit` on stdin never gets processed and the subprocess blocks until the outer 20m `go test` timeout fires — the surfacing symptom is a flaky 20m panic with no diagnostic output. Wrap the invocation in exec.CommandContext with a 30s timeout, matching the existing pattern in test/s3tables/catalog_risingwave/setup_test.go.	2026-04-21 23:27:10 -07:00
Chris Lu	7f67995c24	chore(filer): remove -mount.p2p flag; registry is always on (#9183 ) The filer-side mount peer registry (tier 1 of peer chunk sharing) was gated behind -mount.p2p (default true). Idle cost is negligible — a tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the surface area. Removes the flag from weed filer, weed server (-filer.mount.p2p), and weed mini, and always constructs the registry in NewFilerServer. Also drops the now-dead nil guards in MountRegister/MountList/sweeper and the TestMountRegister_DisabledIsNoOp case.	2026-04-21 23:00:11 -07:00

1 2 3 4 5 ...

13614 Commits