The "sentence detector" is for tokenizing (breaking text into words), not analysis.

The 'brute force' approach for removing non-english texts is to search for higher-page Unicode. If it's over 255, it's not english. (Except maybe for currency.)

What you're talking about are semantically deep problems that have a lot of semi-effective solutions. How deep do you want this analysis to be? How close to IBM Watson do you expect to get?

On 04/30/2013 06:43 AM, Sahar Ebadi wrote:
Hi all,

lets say I have a text and I would like to detect only "good sentences". by
"good sentences" I mean sentences that are 1)complete( grammatically
2)have meaning 3)are in English language.

As far as I found Open NLP sentence detector only detects sentences
according to punctuation(and a list of acronyms it has), so there is
no guarantee that the sentences are real, complete and meaningful sentences.

Now my question is is there any process in NLP that can help me to :

1)find grammatically complete sentences?
2)find if a sentence has meaning or no?
3)filter non-english texts?

any suggestions or sharing useful resources is highly appreciated!

Thanks.


Reply via email to