블로그 이미지
Peter Note
Web & LLM FullStacker, Application Architecter, KnowHow Dispenser and Bike Rider

Publication

Category

Recent Post

'vectorstore'에 해당되는 글 2

  1. 2024.08.10 [LCC-6] LangChain Retriever 이해
  2. 2024.08.09 [LCC-5] LangChain VectorStore 이해
2024. 8. 10. 13:39 LLM FullStacker/LangChain

출처: DeepLearning.ai

 

Retriever은 VectorStore에서 데이터를 검색하는 인터페이스이다. Retriever는 VectorStore에서 as_retriever() 메서드 호출로 생성한다. 리트리버의 두가지 쿼리 방식에 대해 알아보자. 

  - Sematic Similarity

     - k: relevance 관련성 있는 것을 결과에 반영할 갯수

  - Maxium Marginal relavance

     - k: relevance 관련성 있는 것을 결과에 반영할 갯수

     - fetch_k: diverse 다양성 있는 것을 결과에 반영할 갯수

     

 

패키지

Semantic Similarity는 의미적으로 유사한 것을 찾아준다. Reriever를 VectorStore를 통해 얻게 되면 VectoreStore 클래스의 similarity_search() 를 사용한다. 그리고, VectoreStore 클래스를 상속받은 구현체는 similarity_search가 abstractmethod 이기에 구현을 해야 한다. 예로)  Chroma는 VectorStore를 상속받아 구현하고 있다.

  - 관련성있는 것을 찾기 위하여 k 갯수를 설정한다. 

class Chroma(VectorStore):
    def similarity_search(
        self,
        query: str,
        k: int = DEFAULT_K,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Run similarity search with Chroma.

        Args:
            query (str): Query text to search for.
            k (int): Number of results to return. Defaults to 4.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

        Returns:
            List[Document]: List of documents most similar to the query text.
        """
        docs_and_scores = self.similarity_search_with_score(
            query, k, filter=filter, **kwargs
        )
        return [doc for doc, _ in docs_and_scores]

 

MMR(Maximum Marginal Relevance) 는 관련성과 다양성을 동시에 검색하여 제공한다. VectorStore의 max_marginal_relevance_search() 를 사용한다.  Chroma 구현내용을 보자.

   - k: 관련성 결과반영 갯수

   - fetch_k: 다양성 결과반영 갯수 

class Chroma(VectorStore):
    def max_marginal_relevance_search(
        self,
        query: str,
        k: int = DEFAULT_K,
        fetch_k: int = 20,
        lambda_mult: float = 0.5,
        filter: Optional[Dict[str, str]] = None,
        where_document: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs selected using the maximal marginal relevance.
        Maximal marginal relevance optimizes for similarity to query AND diversity
        among selected documents.

        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            fetch_k: Number of Documents to fetch to pass to MMR algorithm.
            lambda_mult: Number between 0 and 1 that determines the degree
                        of diversity among the results with 0 corresponding
                        to maximum diversity and 1 to minimum diversity.
                        Defaults to 0.5.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

        Returns:
            List of Documents selected by maximal marginal relevance.
        """
        if self._embedding_function is None:
            raise ValueError(
                "For MMR search, you must specify an embedding function on" "creation."
            )

        embedding = self._embedding_function.embed_query(query)
        docs = self.max_marginal_relevance_search_by_vector(
            embedding,
            k,
            fetch_k,
            lambda_mult=lambda_mult,
            filter=filter,
            where_document=where_document,
        )
        return docs

 

VectorStore에서 from_documents 또는 from_texts 메서도에 content와 embedding을 파라미터 설정하고 호출하면 VST 타입의 VectorStore 객체를 반환 받을 수 있다. 이를 통해 similarity_search와 max_marginal_relevance_search 를 사용할 수 있다. 

class VectorStore(ABC):
    @classmethod
    @abstractmethod
    def from_texts(
        cls: Type[VST],
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        **kwargs: Any,
    ) -> VST:
        """Return VectorStore initialized from texts and embeddings.

        Args:
            texts: Texts to add to the vectorstore.
            embedding: Embedding function to use.
            metadatas: Optional list of metadatas associated with the texts.
                Default is None.
            **kwargs: Additional keyword arguments.

        Returns:
            VectorStore: VectorStore initialized from texts and embeddings.
        """

 

similarity_search와 max_marginal_relevance_search 를 사용예를 보자.

from langchain.vectorstores import Chroma
from langchain.openai import OpenAIEmbeddings

persistence_path = 'db/chroma'
embeddings = OpenAIEmbeddings()
vectorestore = Chroma(persist_directory=persistence_path, embedding_function=embeddings)

texts = [
    """홍길동을 홍길동이라 부르지 못하는 이유는 무엇인가""",
    """심청이가 물에 빠진것을 효라고 말할 수 있는가, 그것이 진심이었나""",
    """춘향이가 이몽룡을 기다른 것은 진심이었나""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)
question = "진심이라는 의미"

smalldb.similarity_search(question, k=1)
// 결과 
[Document(page_content='춘향이가 이몽룡을 기다른 것은 진심이었나', metadata={})]


smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)
// 결과
[Document(page_content='춘향이가 이몽룡을 기다른 것은 진심이었나', metadata={}),
 Document(page_content='심청이가 물에 빠진것을 효라고 말할 수 있는가, 그것이 진심이었나', metadata={})]

 

위의 두가지외에 다른 검색 타입까지 합쳐 놓은 것이 langchain_core VectorStore의 as_retriever() 이다. from_documents 또는 from_texts를 호출하면 VST (Vector Store Type) 인스턴스를 반환 받아 as_retriever()를 호출한다. as_retriever()는 langchain_community VectorStoreRetriever를 반환하고, 이는 langchain_core의 retrievers.py 파일의 BaseRetriever 클래스를 상속받아 구현하고 있다. 

  - 0.1.46 이후 get_relevant_documents 사용하지 않고 invoke로 대체한다. 

  - _get_relevant_documents 를 구현해야 한다.

  - search_type 

     - similarity

     - similarity_score_threshold : score_threshold 값 설정 이상만 결과에 반영

     - mmr 

  - 결과값: List[Document] 

class VectorStoreRetriever(BaseRetriever):
    """Base Retriever class for VectorStore."""

    vectorstore: VectorStore
    """VectorStore to use for retrieval."""
    search_type: str = "similarity"
    """Type of search to perform. Defaults to "similarity"."""
    search_kwargs: dict = Field(default_factory=dict)
    """Keyword arguments to pass to the search function."""
    allowed_search_types: ClassVar[Collection[str]] = (
        "similarity",
        "similarity_score_threshold",
        "mmr",
    )
    ...
    
 class BaseRetriever(RunnableSerializable[RetrieverInput, RetrieverOutput], ABC):
    def invoke(
        self, input: str, config: Optional[RunnableConfig] = None, **kwargs: Any
    ) -> List[Document]:
        """Invoke the retriever to get relevant documents.
        ...
        """
        ...
        
    
    // 하위 클래스 생성시 호출이 된다. cls는 하위 클래스이다. get_relevant_documents를 할당함.
    def __init_subclass__(cls, **kwargs: Any) -> None:
        super().__init_subclass__(**kwargs)
        # Version upgrade for old retrievers that implemented the public
        # methods directly.
        if cls.get_relevant_documents != BaseRetriever.get_relevant_documents:
            warnings.warn(
                "Retrievers must implement abstract `_get_relevant_documents` method"
                " instead of `get_relevant_documents`",
                DeprecationWarning,
            )
            swap = cls.get_relevant_documents
            cls.get_relevant_documents = (  # type: ignore[assignment]
                BaseRetriever.get_relevant_documents
            )
            cls._get_relevant_documents = swap  # type: ignore[assignment]
        if (
            hasattr(cls, "aget_relevant_documents")
            and cls.aget_relevant_documents != BaseRetriever.aget_relevant_documents
        ):
            warnings.warn(
                "Retrievers must implement abstract `_aget_relevant_documents` method"
                " instead of `aget_relevant_documents`",
                DeprecationWarning,
            )
            aswap = cls.aget_relevant_documents
            cls.aget_relevant_documents = (  # type: ignore[assignment]
                BaseRetriever.aget_relevant_documents
            )
            cls._aget_relevant_documents = aswap  # type: ignore[assignment]
        parameters = signature(cls._get_relevant_documents).parameters
        cls._new_arg_supported = parameters.get("run_manager") is not None
        # If a V1 retriever broke the interface and expects additional arguments
        cls._expects_other_args = (
            len(set(parameters.keys()) - {"self", "query", "run_manager"}) > 0
        )
    
    @abstractmethod
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Get documents relevant to a query.

        Args:
            query: String to find relevant documents for.
            run_manager: The callback handler to use.
        Returns:
            List of relevant documents.
        """
        ...
        
    @deprecated(since="0.1.46", alternative="invoke", removal="0.3.0")
    def get_relevant_documents(
       ....

 

 

API

langchain_core의 BaseRetriever --> langchain_core의 VectorStoreRetriever 또는  langchain_community에서 다양한 <name>Retriever 를 구현한다. 

  - invoke, ainvoke 호출

  - get_relevant_documents 호출은 invoke 로 대체되고, v0.3.0 에서 삭제될 예정이다. 

 

<참조>

- 공식문서: https://python.langchain.com/v0.2/docs/integrations/retrievers/

 

Retrievers | 🦜️🔗 LangChain

A retriever is an interface that returns documents given an unstructured query.

python.langchain.com

- API: https://api.python.langchain.com/en/latest/community_api_reference.html#module-langchain_community.retrievers

- LangChain KR: https://wikidocs.net/234016

 

01. 벡터저장소 지원 검색기(VectorStore-backed Retriever)

.custom { background-color: #008d8d; color: white; padding: 0.25em 0.5…

wikidocs.net

 

posted by Peter Note
2024. 8. 9. 10:24 LLM FullStacker/LangChain

임베딩을 통해 실수(float) 벡터값은 Vector Store에 저장을 한다. 그리고 사용자의 Query에 대해서도 벡터로 변환하여 Vector Store에 Sematic Search를 한다.

출처: DeepLearning.ai

 

유사도가 높은 문서뭉치들을 프롬프트에 담아 LLM에 요청한다. 

출처: DeepLearning.ai

 

 

패키지

vectorstores 패키지는 langchain.vectorstores 패키지에서 langchain_community.vectorstores 패키지의 모듈을 re-export 하고 있다. 

 

- 70 개가량의 벡터 저장소 구현체를 제공한다. https://python.langchain.com/v0.2/docs/integrations/vectorstores/

 

Vector stores | 🦜️🔗 LangChain

📄️ Hippo Transwarp Hippo is an enterprise-level cloud-native distributed vector database that supports storage, retrieval, and management of massive vector-based datasets. It efficiently solves problems such as vector similarity search and high-densit

python.langchain.com

 

- 가장 많이 사용하는 것은 FAISS, Chroma 등이다. 

 

Chroma 패키지는 내부적으로 chromadb 패키지를 사용하여 LangChain과 연동시켜주는 인터페이스이다.

- VectoreStore 생성시 "embedding_function", "collection_name"을 지정할 수 있다. 

from chromadb import Client
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# 1. OpenAI Embeddings 초기화
openai_embedding = OpenAIEmbeddings()

# 2. Chroma 벡터스토어 클라이언트 생성
client = Client()

# 3. Chroma 벡터스토어 생성
vectorstore = Chroma(
    embedding_function=openai_embedding,  # OpenAI Embedding 사용
    collection_name="my_openai_collection",  # 컬렉션 이름
    client=client                           # Chroma 클라이언트
)

# 4. 데이터 추가
documents = ["This is a test document.", "Another document for testing."]
vectorstore.add_texts(texts=documents)

# 5. 데이터 검색
query = "test"
results = vectorstore.similarity_search(query, k=2)

# 6. 검색 결과 출력
for result in results:
    print(f"Document: {result.page_content}, Score: {result.score}")

 

또는 from_documents 클래스메서드를 이용하여 호출한다. 

- SentenceTransformerEmbeddings 는 내부적으로 HuggingFaceEmbeddings 을 사용하고 디폴디 임베딩 모델로 "BAAI/bge-large-en"을 사용한다. 

- load() --> split_documents() --> from_documents() 를 통해서 

# import
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import CharacterTextSplitter

# 문서를 로드하고 청크로 분할합니다.
loader = TextLoader("./data/appendix-keywords.txt")
documents = loader.load()

# 문서를 청크로 분할합니다.
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# 오픈 소스 임베딩 함수를 생성합니다.
stf_embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Chroma에 로드합니다.
db = Chroma.from_documents(docs, stf_embeddings)

# 질의합니다.
query = "What is Word2Vec?"
docs = db.similarity_search(query)

# 결과를 출력합니다.
print(docs[0].page_content)

 

API

VectorStore 드라이버 구현체는 langchain_core의 VectoreStore를 상속받아 구현한다. 

 

- VectorStore 를 상속받아 __init__ 구현 

 class Chroma(VectorStore):  
   def __init__(
        self,
        collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
        embedding_function: Optional[Embeddings] = None,
        persist_directory: Optional[str] = None,
        client_settings: Optional[chromadb.config.Settings] = None,
        collection_metadata: Optional[Dict] = None,
        client: Optional[chromadb.Client] = None,
        relevance_score_fn: Optional[Callable[[float], float]] = None,
    ) -> None:
        """Initialize with a Chroma client."""
        try:
            import chromadb
            import chromadb.config
        except ImportError:
            raise ImportError(
                "Could not import chromadb python package. "
                "Please install it with `pip install chromadb`."
            )

        if client is not None:
            self._client_settings = client_settings
            self._client = client
            self._persist_directory = persist_directory
        else:
            if client_settings:
                # If client_settings is provided with persist_directory specified,
                # then it is "in-memory and persisting to disk" mode.
                client_settings.persist_directory = (
                    persist_directory or client_settings.persist_directory
                )
                if client_settings.persist_directory is not None:
                    # Maintain backwards compatibility with chromadb < 0.4.0
                    major, minor, _ = chromadb.__version__.split(".")
                    if int(major) == 0 and int(minor) < 4:
                        client_settings.chroma_db_impl = "duckdb+parquet"

                _client_settings = client_settings
            elif persist_directory:
                # Maintain backwards compatibility with chromadb < 0.4.0
                major, minor, _ = chromadb.__version__.split(".")
                if int(major) == 0 and int(minor) < 4:
                    _client_settings = chromadb.config.Settings(
                        chroma_db_impl="duckdb+parquet",
                    )
                else:
                    _client_settings = chromadb.config.Settings(is_persistent=True)
                _client_settings.persist_directory = persist_directory
            else:
                _client_settings = chromadb.config.Settings()
            self._client_settings = _client_settings
            self._client = chromadb.Client(_client_settings)
            self._persist_directory = (
                _client_settings.persist_directory or persist_directory
            )

        self._embedding_function = embedding_function
        self._collection = self._client.get_or_create_collection(
            name=collection_name,
            embedding_function=None,
            metadata=collection_metadata,
        )
        self.override_relevance_score_fn = relevance_score_fn

 

- from_documents 는 classmethod 이면서 벡터객체를 반환한다.

   - documents: splitted 된 document 리스트

   - embedding: embedding model 인스턴스

    @classmethod
    def from_documents(
        cls: Type[VST],
        documents: List[Document],
        embedding: Embeddings,
        **kwargs: Any,
    ) -> VST:
        """Return VectorStore initialized from documents and embeddings.

        Args:
            documents: List of Documents to add to the vectorstore.
            embedding: Embedding function to use.
            **kwargs: Additional keyword arguments.

        Returns:
            VectorStore: VectorStore initialized from documents and embeddings.
        """
        texts = [d.page_content for d in documents]
        metadatas = [d.metadata for d in documents]
        return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

내부에서 클래스메서드인 from_texts를 호출한다. from_texts는 abstractmethod로 각 vectorstore에서 구현한다. 

    @classmethod
    @abstractmethod
    def from_texts(
        cls: Type[VST],
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        **kwargs: Any,
    ) -> VST:
        """Return VectorStore initialized from texts and embeddings.

        Args:
            texts: Texts to add to the vectorstore.
            embedding: Embedding function to use.
            metadatas: Optional list of metadatas associated with the texts.
                Default is None.
            kwargs: Additional keyword arguments.

        Returns:
            VectorStore: VectorStore initialized from texts and embeddings.
        """

 

 

- similarity_search 는 abstractmethod로 각 vectorstore에서 구현을 해야 한다. 

    @abstractmethod
    def similarity_search(
        self, query: str, k: int = 4, **kwargs: Any
    ) -> List[Document]:
        """Return docs most similar to query.

        Args:
            query: Input text.
            k: Number of Documents to return. Defaults to 4.
            **kwargs: Arguments to pass to the search method.

        Returns:
            List of Documents most similar to the query.
        """

 

- search 는 검색 타입을 선택한다. "similarity", "similarity_score_threshold", "mmr" 등이 있다. 

  - 반환값은 List[Document] 이다. 

    def search(self, query: str, search_type: str, **kwargs: Any) -> List[Document]:
        """Return docs most similar to query using a specified search type.

        Args:
            query: Input text
            search_type: Type of search to perform. Can be "similarity",
                "mmr", or "similarity_score_threshold".
            **kwargs: Arguments to pass to the search method.

        Returns:
            List of Documents most similar to the query.

        Raises:
            ValueError: If search_type is not one of "similarity",
                "mmr", or "similarity_score_threshold".
        """
        if search_type == "similarity":
            return self.similarity_search(query, **kwargs)
        elif search_type == "similarity_score_threshold":
            docs_and_similarities = self.similarity_search_with_relevance_scores(
                query, **kwargs
            )
            return [doc for doc, _ in docs_and_similarities]
        elif search_type == "mmr":
            return self.max_marginal_relevance_search(query, **kwargs)
        else:
            raise ValueError(
                f"search_type of {search_type} not allowed. Expected "
                "search_type to be 'similarity', 'similarity_score_threshold'"
                " or 'mmr'."
            )

 

 

- as_retriever 는 VectorStore 에는 VectorStoreRetriever 를 생성하여 반환한다. 

def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever:

 

Retriever 에 대해 다음 글에서 살펴보자. 

 

 

<참조>

- API: https://api.python.langchain.com/en/latest/community_api_reference.html#module-langchain_community.vectorstores

- 공식문서: https://python.langchain.com/v0.2/docs/integrations/vectorstores/

 

Vector stores | 🦜️🔗 LangChain

📄️ Hippo Transwarp Hippo is an enterprise-level cloud-native distributed vector database that supports storage, retrieval, and management of massive vector-based datasets. It efficiently solves problems such as vector similarity search and high-densit

python.langchain.com

- LangChain KR: https://wikidocs.net/234013

 

01. 벡터저장소(VectorStore) 사용법 톺아보기

.custom { background-color: #008d8d; color: white; padding: 0.25em 0.5…

wikidocs.net

 

posted by Peter Note
prev 1 next