Your Ad Here

Posted By

ctran on 09/05/07


Tagged


Versions (?)

Who likes this?

1 person have marked this snippet as a favorite

webstic


VietnameseAnalyzer.rb


 / Published in: Ruby
 

Convert Vietnamese characters into ASCII so they can be indexed and searched.

  1. require 'unicode'
  2.  
  3. # Normalizes token text to lower case.
  4. class UnicodeLowerCaseFilter
  5. def initialize(token_stream)
  6. @input = token_stream
  7. end
  8.  
  9. def text=(text)
  10. @input.text = text
  11. end
  12.  
  13. def next()
  14. t = @input.next()
  15.  
  16. if (t == nil)
  17. return nil
  18. end
  19.  
  20. t.text = Unicode.downcase(t.text)
  21. return t
  22. end
  23. end
  24.  
  25. class VietnameseAnalyzer < Ferret::Analysis::Analyzer
  26. include Ferret::Analysis
  27.  
  28. # Standard Character mappings to remove all special characters
  29. # so only default ASCII characters get indexed
  30. CHARACTER_MAPPINGS = {
  31. ['á','à','ạ','ả','ã','ă','ắ','ằ','ặ','ẳ','ẵ','â','ấ','ầ','ậ','ẩ','ẫ'] => 'a',
  32. ['đ'] => 'd',
  33. ['é','è','ẹ','ẻ','ẽ','ê','ế','ề','ệ','ể','ễ'] => 'e',
  34. ['í','ì','ị','ỉ','ĩ'] => 'i',
  35. ['ó','ò','ọ','ủ','õ','ơ','ớ','ờ','ợ','ở','ỡ','ô','ố','ồ','ộ','ổ','ỗ'] => 'o',
  36. ['ú','ù','ụ','ů','ũ','ư','ứ','ừ','ự','ử','ữ'] => 'u',
  37. ['ý','ỳ','ỵ','ỷ','ỹ'] => 'y',
  38. } unless defined?(CHARACTER_MAPPINGS)
  39.  
  40. def token_stream(field, str)
  41. ts = StandardTokenizer.new(str)
  42. ts = UnicodeLowerCaseFilter.new(ts)
  43. ts = MappingFilter.new(ts, CHARACTER_MAPPINGS)
  44. end
  45. end

Report this snippet  

You need to login to post a comment.