Re: Post link to Tika in Action book on Tika website?

2010-08-02 Thread Oleg Tikhonov
+1, positively. On Mon, Aug 2, 2010 at 8:33 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Tika community, Jukka Zitting and I are working on the Tika in Action book [1]. How would everyone feel about us posting a link to it on the Tika website [2]? If so, I'll

Re: [jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

2010-08-24 Thread Oleg Tikhonov
Hi Ken, I used Nutch's LanguageProfiler in order to produce language profile. More about this issue you can find: http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/authors.html (It's not self - promoting !) Download the sources, using ant task you'll able to create lang

Re: Error thrown with TikaConfig() constructor

2010-09-12 Thread Oleg Tikhonov
There are the situations, I could think about, where you would like to implement customized classloader: 1. You need different hierarchy to load classes, as OSGi for instance. Hollywood principle if you like. 2. When you need to run different versions of classes or jars. For example, you want to

Re: [jira] Commented: (TIKA-593) Tika network server

2011-02-08 Thread Oleg Tikhonov
Why do not use: http://felix.apache.org/site/apache-felix-http-service.html On Tue, Feb 8, 2011 at 5:06 PM, Chris A. Mattmann (JIRA) j...@apache.orgwrote: [

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-03-31 Thread Oleg Tikhonov
Hello Tran Nam Quang, It uses CHMLIB C library, i.e. JNI. From my previous experience, it works for limited amount of os'es. It does not work in Solaris or AIX. The really good library with limitations mentioned above is http://sevenzipjbind.sourceforge.net/ and also LGPL (I would say, the best

Re: [jira] [Commented] (TIKA-546) Add ability to create language profiles to tika-app

2011-04-14 Thread Oleg Tikhonov
Sami, Chris and me, some time ago did that for developerWorks tutorial, the clean code exist, although may be out of day. I thought, is it good idea to use Nutch code inside Tika? Might be Nutch guys could extend it as independent module? On Thu, Apr 14, 2011 at 3:01 PM, Sami Siren (JIRA)

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-06-07 Thread Oleg Tikhonov
Hi Chris, I've applied the patch to the tika-parsers/src/main/java/org/apache/tika/parser/chm, also added 3 chm files to the tika-parsers\src\test\resources\test-documents and the tests. BR, Oleg On Sun, Jun 5, 2011 at 1:32 AM, Chris A. Mattmann (JIRA) j...@apache.orgwrote: [

Re: [jira] [Assigned] (TIKA-245) Support of CHM Format

2011-06-07 Thread Oleg Tikhonov
Thank you Chris and Jukka! I tried to keep the KISS principle, but couldn't. On Tue, Jun 7, 2011 at 6:49 PM, Chris A. Mattmann (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Chris A.

support of Java 5

2011-06-08 Thread Oleg Tikhonov
Hello all, As you may know, Oracle announced Java 5 SE EOL (End Of Life) since 2009 . However, we are still supporting Java 5 SE. What is a rational behind the walls? Why we encourage our costumers do not upgrade to the more modern version(s) of Java? Developing new products/features we cannot

Re: Build failed in Jenkins: Tika-trunk #563

2011-06-08 Thread Oleg Tikhonov
Chris, Nick, I've attached the patch, hope now it will work/compile. BR, Oleg On Wed, Jun 8, 2011 at 6:09 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Nick, Yep, didn't even catch it on commit. Oleg emailed me offlist (which I asked him to bring onlist) and caught

Re: Build failed in Jenkins: Tika-trunk #563

2011-06-08 Thread Oleg Tikhonov
Hi Jukka, no problem at all. I'll reformat and commit tomorrow then. BR, Oleg On Wed, Jun 8, 2011 at 9:56 PM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi Oleg, On Wed, Jun 8, 2011 at 8:20 PM, Oleg Tikhonov o...@apache.org wrote: I've attached the patch, hope now it will work/compile

Re: Build failed in Jenkins: Tika-trunk #564

2011-06-09 Thread Oleg Tikhonov
Jukka, Committed revision 1133955. BR, Oleg On Thu, Jun 9, 2011 at 11:52 AM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi, On Thu, Jun 9, 2011 at 4:20 AM, Apache Jenkins Server jenk...@builds.apache.org wrote: cause : Too many unapproved licenses: 4 The following files in

Re: Build failed in Jenkins: Tika-trunk #580

2011-07-18 Thread Oleg Tikhonov
Good evening, What are the files that cannot pass the rat scanning? Thanks in advance, Oleg On Mon, Jul 18, 2011 at 10:55 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: See https://builds.apache.org/job/Tika-trunk/580/changes Changes: [kkrugler] Add quick test to validate that

Re: [WARNING] Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7

2011-07-28 Thread Oleg Tikhonov
FYI, On Fri, Jul 29, 2011 at 12:13 AM, Uwe Schindler uschind...@apache.orgwrote: Hello Apache Lucene Apache Solr users, Hello users of other Java-based Apache projects, Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This

Re: http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

2011-08-22 Thread Oleg Tikhonov
Hey, and welcome to the Tika. Using Eclipse you would better download an eclipse plug-in: http://m2eclipse.sonatype.org/sites/m2e Having downloaded and installed plug-in, your next step could be importing Tika project like that: ' *File* -* Import* - *Existing Maven Project* ' ... However, if

Re: [jira] [Created] (TIKA-699) Automatic checks against backwards-incompatible API changes

2011-08-26 Thread Oleg Tikhonov
I'm in favor, +1. On Fri, Aug 26, 2011 at 1:22 PM, Jukka Zitting (JIRA) j...@apache.orgwrote: Automatic checks against backwards-incompatible API changes --- Key: TIKA-699 URL:

Re: Welcome Mike McCandless to the Tika PMC and as a Tika Committer

2011-08-29 Thread Oleg Tikhonov
Hi Make! Congrats! I worked with OmniFind edition at IBM Jerusalem :-) ..., I heard about you from my colleagues (Josemina, Yariv) and now met you here! Welcome! On Mon, Aug 29, 2011 at 6:14 PM, Michael McCandless luc...@mikemccandless.com wrote: Thanks Chris! Here's a quick intro: I

Re: [jira] [Commented] (TIKA-546) Add ability to create language profiles to tika-app

2011-09-17 Thread Oleg Tikhonov
Yes, it's resolved, need to change the status. 2011/9/18 Jan Høydahl (JIRA) j...@apache.org [ https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107290#comment-13107290] Jan Høydahl commented on

Re: [VOTE] Apache Tika 0.10 release rc #1

2011-09-26 Thread Oleg Tikhonov
In favor of releasing the Tika 0.10, +1 On Mon, Sep 26, 2011 at 9:50 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, A first release candidate for the Tika 0.10 release is available at: http://people.apache.org/~mattmann/apache-tika-0.10/rc1/ The release

Re: [jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Oleg Tikhonov
(DjVu) format Key: TIKA-513 URL: https://issues.apache.org/jira/browse/TIKA-513 Project: Tika Issue Type: New Feature Components: parser Reporter: Oleg Tikhonov It might

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2011-10-22 Thread Oleg Tikhonov
Hi Tran Nam Quang, Currently our CHM extractor skips all entities that are not HTML. It would be great if you could write a list of desired entities to be extracted. In addition, if you can, please attach the CHM files you're working with. BR, Oleg On Sat, Oct 22, 2011 at 8:08 AM, Tran Nam

Re: location of pdfbox in sources of Tika

2011-10-31 Thread Oleg Tikhonov
Hi Ahmad, I hope you built pdfbox using a maven, i.e. running mvn clean install. If so, a new pdfbox jar file is located in the .m2 local repository. In addition, please find a pom.xml under ../tika-parsers and change the following: dependency groupIdorg.apache.pdfbox/groupId

Re: location of pdfbox in sources of Tika

2011-11-01 Thread Oleg Tikhonov
, Oleg Tikhonov o...@apache.org wrote: Hi Ahmad, I hope you built pdfbox using a maven, i.e. running mvn clean install. If so, a new pdfbox jar file is located in the .m2 local repository. In addition, please find a pom.xml under ../tika-parsers and change the following: dependency

Re: [jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

2012-02-01 Thread Oleg Tikhonov
For Chinese we need to create/get two profiles: Chinese Traditional and Chinese Simplified. Oleg On Thu, Feb 2, 2012 at 6:13 AM, James Sullivan (Commented) (JIRA) j...@apache.org wrote: [

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-08 Thread Oleg Tikhonov
Here is my +1, this time tested only on Windows 7 x86-64 PE. BR, Oleg On Thu, Mar 8, 2012 at 5:11 PM, Alex Ott alex...@gmail.com wrote: +1 unpacked sources, compiled, tests passed. compiled tika-app works correctly. separately downloaded tika-app-1.1.jar also works correctly for me The

Re: [VOTE] Apache Tika 1.2 release rc #1

2012-07-11 Thread Oleg Tikhonov
Hi, here is my +1. Kind regards, Oleg On Thu, Jul 12, 2012 at 2:48 AM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi, On Wed, Jul 11, 2012 at 4:27 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: On Jul 11, 2012, at 6:43 AM, Michael McCandless wrote: Why are there

Re: [VOTE] Graduate Apache Any23 from the Apache Incubator

2012-08-06 Thread Oleg Tikhonov
Hi Guys, +1 for the graduation. Keep going ! KR, Oleg On Mon, Aug 6, 2012 at 11:44 PM, Dave Meikle loo...@gmail.com wrote: Hi, On 3 Aug 2012, at 18:50, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: ... I'm now going to call for a community VOTE (before heading to the

Re: [jira] [Commented] (TIKA-93) OCR support

2012-11-06 Thread Oleg Tikhonov
Hey, I've tried to look up the distribution, however could not find the sources, in binaries they provide only Nokia distribution. It would be nice if you could play with it and say your impression(s). BR, Oleg On Wed, Nov 7, 2012 at 2:52 AM, Pei Chen (JIRA) j...@apache.org wrote: [

Re: [jira] [Commented] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-12 Thread Oleg Tikhonov
Hi David, in the same folder level, say /home/tika/, where you run 'mvn clean install' just put the following command: mvn dependency:list It will print out all the jars which a project depends on. Hope it helps. On Wed, Dec 12, 2012 at 3:35 PM, David Morana (JIRA) j...@apache.orgwrote:

Re: [jira] [Commented] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-12 Thread Oleg Tikhonov
David, is it failing on some particular file or always, never mind what goes on? POI hints that there is illegal offset, that probably is a cause of the error. --Oleg On Wed, Dec 12, 2012 at 4:31 PM, David Morana (JIRA) j...@apache.orgwrote: [

Re: [jira] [Updated] (TIKA-1048) XMLParser should add whitespace between elements

2012-12-20 Thread Oleg Tikhonov
Hi Make, May be consider using of UIMA (the rule engine) ? BR, Oleg On Thu, Dec 20, 2012 at 1:05 PM, Michael McCandless (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Michael

Re: [jira] [Commented] (TIKA-93) OCR support

2013-01-04 Thread Oleg Tikhonov
I've tried without success. There is more than it seems. JavaOCR is not an option in its current status. Temporal solution can be wrapper of tesseract however making tesseract to work on multi-platforms is still quite difficult. Best regards, Oleg On Fri, Jan 4, 2013 at 3:46 PM, Maciej

Re: [jira] [Commented] (TIKA-93) OCR support

2013-01-14 Thread Oleg Tikhonov
From DejaVu (particular case) point of view possible flow can be as follows: 1. Extract images 2. For each image extract text using OCR 2.1 Detect language 2.2.Detect font type . So, language, font type may be used for providing metadata. I think it should be seamless as much as possible.

Re: [VOTE] Apache Tika 1.3 Release Candidate #1

2013-01-19 Thread Oleg Tikhonov
Hey Dave, Could not test on other systems than Windows 7 x64. All tests passed successfully ! [x] +1 Release this package as Apache Tika 1.3 BR, Oleg On Sat, Jan 19, 2013 at 6:30 AM, Dave Meikle loo...@gmail.com wrote: http://svn.apache.org/repos/asf/tika/tags/tika-1.3/

Re: [DISCUSS] Should Tika require Java6? (was Re: Build failed in Jenkins: Tika-trunk #977)

2013-02-08 Thread Oleg Tikhonov
Back to the future. Aha moment !!! Here is mine +1. According to Oracle In February 2011 Oracle announced the End of Public Updates for their Java SE 6 products for July 2012. In February 2012 Oracle extended the End of Public Updates for 4 months, to November 2012. . Oleg On Fri, Feb 8, 2013

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

2013-03-05 Thread Oleg Tikhonov
Tika chm support has its limitations, can you provide such file(s) for further investigation ? BR, Oleg On Wed, Mar 6, 2013 at 1:10 AM, Tejas Patil (JIRA) j...@apache.org wrote: [

Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
In favor, [x] +1 Release this package as Apache Tika 1.4. Tested on Linux ubuntu 3.8.0-23-generic x64. May be we have to update some dependencies. Also ran a code coverage using mvn plugin, cobertura. BR, Oleg Here is a link to code coverage report dependencies updates (available @dev).

Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
In favor, [x] +1 Release this package as Apache Tika 1.4. Tested on Linux ubuntu 3.8.0-23-generic x64. May be we have to update some dependencies. Also ran a code coverage using mvn plugin, cobertura. BR, Oleg Here is a link to code coverage report dependencies updates (available @dev).

Re: [VOTE] Apache TIka 1.4 Release Candidate #1

2013-06-16 Thread Oleg Tikhonov
I've tried to send some comments about release candidate, however got delivery failure error. I'm out of list ? BR, Oleg On Sun, Jun 16, 2013 at 9:07 PM, Chris Mattmann mattm...@apache.org wrote: Ouch, just saw this. Oliver, I'm happy to commit the updated patch to the trunk but do you

Re: [VOTE] Apache TIka 1.4 Release Candidate #2

2013-06-17 Thread Oleg Tikhonov
Hey, All tests are passed on following platforms: 1. Linux ubuntu 3.8.0-25-generic x86_64 Ubuntu 13.04 2. Microsoft Windows 7 Enterprise, x64-based PC Please have a look: https://drive.google.com/?tab=moauthuser=0#folders/0B_DmgPkneiMgOFg2ZXBsOTZkRHc There are two files, one of them contains list

Re: [jira] [Updated] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Oleg Tikhonov
Hi, can you attach the problematic file ? Thanks. On Tue, Jul 23, 2013 at 4:46 PM, Hong-Thai Nguyen (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Hong-Thai Nguyen updated TIKA-1152:

Re: [jira] [Comment Edited] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-29 Thread Oleg Tikhonov
Thanks ! BR, Oleg On Mon, Jul 29, 2013 at 4:47 PM, Hong-Thai Nguyen (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716538#comment-13716538] Hong-Thai Nguyen edited

Re: Apache Tika for Android

2013-08-29 Thread Oleg Tikhonov
Hi, Vasily, Welcome aboard ! Just keep in mind, Tika is written on Java, so it can run on any JVM which supports that. For starters please refer to: http://tika.apache.org/1.4/gettingstarted.html Generally, Tika supports extracting most known type including PDFs. Apache Tika is Apache Software

Re: Apache tika installation issue

2013-09-27 Thread Oleg Tikhonov
Hi, if you meant how to import Tika's project then here the steps: 1. In Eclipse -- File -- Import ... 2. Choose Existing Maven Project, click Next; 3. Point to Tika project, clicking on Browse button, say tika-core 4. Next, click on Finish. That's it. Hope it helps. BR, Oleg On Fri, Sep

Re: Having Problem in Word Count and Language Detaction

2013-10-26 Thread Oleg Tikhonov
Hi Animesh, my wild guess is that N-gram profile for Chinese wasn't trained pretty well. Try recreate Chinese language profile. Have a look here: http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html Hope it helps. On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann

Re: Having Problem in Word Count and Language Detaction

2013-10-26 Thread Oleg Tikhonov
This one is better https://issues.apache.org/jira/browse/TIKA-546 On Sat, Oct 26, 2013 at 10:05 PM, Oleg Tikhonov o...@apache.org wrote: Hi Animesh, my wild guess is that N-gram profile for Chinese wasn't trained pretty well. Try recreate Chinese language profile. Have a look here: http

Re: NonSequentialPDFParser

2013-12-02 Thread Oleg Tikhonov
Think, we must. +1 for such improvement. BR, Oleg On Mon, Dec 2, 2013 at 4:17 PM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: Hi all, NonSequentialPDFParser may increase 45% parsing performance on PDF extraction. Should we integrate in Tika ?

Re: Switch to JUnit 4.x?

2013-12-14 Thread Oleg Tikhonov
Hi Ken, no at all. +1 - go for it! BR, Oleg On Sun, Dec 15, 2013 at 1:39 AM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi all, See https://issues.apache.org/jira/browse/TIKA-1209 Any objections to switching to JUnit 4.11? -- Ken -- Ken Krugler +1

Re: [jira] [Commented] (TIKA-93) OCR support

2013-12-24 Thread Oleg Tikhonov
Hi Frank, It's not so easy especially having dependency on native libraries. It's also depends on trained profiles, languages fonts. The questions are - what are platforms we want to support. what are languages and fonts. BR, Oleg On Tue, Dec 24, 2013 at 9:48 AM, frank (JIRA) j...@apache.org

Re: [VOTE] Apache Tika 1.5 RC1

2014-02-04 Thread Oleg Tikhonov
Hi David, [x] +1 Release this package as Apache Tika 1.5 Thanks! BR, Oleg On Wed, Feb 5, 2014 at 3:59 AM, David Meikle loo...@gmail.com wrote: Hi Guys, A candidate for the Tika 1.5 release is now available at: http://people.apache.org/~dmeikle/tika-1.5-rc1/ The release candidate is a

Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-08 Thread Oleg Tikhonov
Hi Grant, what you're doing seems great. I've checked the Tess4j (http://tess4j.sourceforge.net/) they released and distributed under the Apache License, v2.0http://www.apache.org/licenses/LICENSE-2.0.html . Hope it helps. BR, Oleg On Sat, Feb 8, 2014 at 1:14 PM, Grant Ingersoll (JIRA)

Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-08 Thread Oleg Tikhonov
Hi, There is another code coverage maven plug-in, called cobertura. If you run *mvn clean install cobertura:cobertura* no need to put it in the pom. Hope it helps. On Sat, Feb 8, 2014 at 10:17 PM, Grant Ingersoll (JIRA) j...@apache.orgwrote: [

Re: [jira] [Commented] (TIKA-93) OCR support

2014-02-10 Thread Oleg Tikhonov
@Timo, On the other hand this Parser can serves as a Composite for more complicated parsers. For example of DejaVu, you can extract images and parse them one by one, and after just to append extracted text. BR, Oleg On Mon, Feb 10, 2014 at 11:09 AM, Timo Boehme (JIRA) j...@apache.orgwrote:

Re: Searching for Tika Jira issues using Lucene

2014-03-05 Thread Oleg Tikhonov
Hi Mike! Sounds great! Thanks. Oleg On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless luc...@mikemccandless.com wrote: Team, If you want to search for Tika Jira issues, I just added Tika coverage into the Lucene dog food server we use for finding Lucene/Solr issues at

Re: [jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Oleg Tikhonov
Hi Rupert, agree about javax.servlet;resolution:=optional, javax.servlet.http;resolution:=optional, Will check it out tomorrow. Thanks !!! On Mon, Apr 28, 2014 at 4:44 PM, Rupert Westenthaler (JIRA) j...@apache.org wrote: [

Re: [jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-29 Thread Oleg Tikhonov
No problem. Will test it. On Tue, Apr 29, 2014 at 3:43 PM, Rupert Westenthaler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984251#comment-13984251] Rupert

Re: [jira] [Commented] (TIKA-93) OCR support

2014-05-29 Thread Oleg Tikhonov
Guys, Tesseract is by itself a project that written on C/C++ and should be compiled differently for each platform. Personally, i would put a requirement for those who want to work with tesseract. Not sure that putting Tesseract in the sources is a right way to go. How good tesseract is - depends

Re: Stack Overflow Question

2014-06-30 Thread Oleg Tikhonov
Hi, Please have a look at provided code: [code] Parser parser = new AutoDetectParser(); // Should auto-detect! ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); InputStream stream = ZipParserTest.class.getResourceAsStream(

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-27 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.6. Tested on the following systems: 1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC 2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux Thanks, Oleg On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980)

Re: [jira] [Created] (TIKA-1405) German content detected as French

2014-08-30 Thread Oleg Tikhonov
Hi, does context contain only one language or it's mixed. if the text contains a single language then it seems something strange in our language profiles. If it mixed - then it kindda ok. The first detected will be an answer. What is a size of context? one word or bunch of text? Basically to

Re: 1.7 release?

2014-10-20 Thread Oleg Tikhonov
Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422

Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev

Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch

Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov o...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October

Re: 1.7 release?

2014-10-24 Thread Oleg Tikhonov
AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Sorry!!! On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Oleg, will try tomorrow for me Los angeles time

Re: [jira] [Created] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-02-06 Thread Oleg Tikhonov
Hi, Just one quess. Did you check the permissons, does it have executable permission? Br, Oleg On 6 Feb 2015 12:15, Sean Zhao (JIRA) j...@apache.org wrote: Sean Zhao created TIKA-1543: --- Summary: TesseractOCRParser.setTesseractPath() doesn't work

Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-29 Thread Oleg Tikhonov
+1 for 1.8 release. On 29 Mar 2015 02:04, Konstantin Gribov gros...@gmail.com wrote: Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since pdfbox 1.8.8 hangs on some pdf forms. -- Best regards, Konstantin Gribov сб, 28 марта 2015 г. в 23:22, Konstantin Gribov

Re: [jira] [Closed] (TIKA-993) Language Detection Fault

2015-03-02 Thread Oleg Tikhonov
Hi, Just for the record ... It can happen if a file contains context that at least written in two different languages. For instance, the first half of file, say, is a German and the second one, say ... a French. In such case detection would be faulty. Br, Oleg On 3 Mar 2015 04:03, Tyler Palsulich

Re: [jira] [Closed] (TIKA-993) Language Detection Fault

2015-03-03 Thread Oleg Tikhonov
, What do you mean, the detection is faulty? What is the expected result in that case? Thanks, Tyler On Mar 3, 2015 1:10 AM, Oleg Tikhonov o...@apache.org wrote: Hi, Just for the record ... It can happen if a file contains context that at least written in two different languages

Re: trunk test failure

2015-03-26 Thread Oleg Tikhonov
Hi Chris, just to confirm: [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent . SUCCESS [ 9.268 s] [INFO] Apache Tika core ... SUCCESS [ 25.823 s]

Re: TIKA-1423 Build a parser to extract data from GRIB formats not good with Java 6

2015-01-30 Thread Oleg Tikhonov
Hi there, +1 for dropping. On 30 Jan 2015 05:05, Tyler Palsulich tpalsul...@gmail.com wrote: +1 Tyler On Jan 29, 2015 9:52 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 move to 1.7 Sent from my iPhone On Jan 29, 2015, at 5:04 PM, Allison, Timothy B.

Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Oleg Tikhonov
I Tim, Having looked at CC, a couple of ideas crossed the mind. I think it's cool. +1. BR, Oleg On 3 Apr 2015 17:29, Allison, Timothy B. talli...@mitre.org wrote: All, What do you think? https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 On Friday, April 3, 2015 at 8:23:11

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-15 Thread Oleg Tikhonov
Hi Tyler, good job, indeed !!! [x] +1 Release this package as Apache Tika 1.8 On Wed, Apr 15, 2015 at 8:22 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Tyler! +1 from me: SIGS, checksums check out: [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%

Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-08 Thread Oleg Tikhonov
Hi, [x] +1 Release this package as Apache Tika 1.8. Tested on: Ubuntu 14.10, x86_64. Java 1.7 (Oracle) Don't we want to update the following dependencies: biz.aQute:bndlib . 1.43.0 - 2.0.0.20130123-133441 org.apache.felix:org.apache.felix.scr.annotations 1.6.0 - 1.9.10

Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Oleg Tikhonov
Hi, All basic tests are passed. java version 1.7.0_75 Java(TM) SE Runtime Environment (build 1.7.0_75-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode) Linux/Ubuntu x86_64 Superb !!! [x] +1 Release this package as Apache Tika 1.9 Thanks, Oleg On Tue, Jun 9, 2015 at 2:12 PM,

Re: Apache Tika: In use at Goldman Sachs

2015-08-20 Thread Oleg Tikhonov
Wow !!! Amazing. How does it perform? BR, Oleg On Thu, Aug 20, 2015 at 9:48 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Just saw this online: http://www.informationweek.com/software/enterprise-applications/goldman-sac hs-puts-elasticsearch-to-work/d/d-id/1321778

Re: release Tika 1.10?

2015-08-04 Thread Oleg Tikhonov
Thanks! +1 BR, Oleg On Tue, Aug 4, 2015 at 5:37 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA

Re: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-04 Thread Oleg Tikhonov
Hi, thanks for doing that !!! +1 for the release. Ran on Kubuntu 15 x64. All basic tests are passed. BR, Oleg On Tue, Aug 4, 2015 at 6:17 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 from me, great work Dave SIGS and CHECKSUMS are sound:

Re: Bayesian N-Gram Language Detection

2015-07-29 Thread Oleg Tikhonov
+1 !!! My two cents. Please also add ability to change/retrain/tote language profiles. Thanks !!! BR, Oleg On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Cool. Well with this one I found, along with language-detector, along with Ramirez and the

Re: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-25 Thread Oleg Tikhonov
Hi guys, all looks fine on basic set up in x86_64 Ubuntu, however I got the following: Running org.apache.tika.parser.journal.JournalParserTest 25 Oct 2015 10:45:53 WARN PhaseInterceptorChain - Interceptor for { http://localhost:8080/grobid}WebClient has thrown exception, unwinding now

Re: Remove support for building language identifier profiles?

2015-08-30 Thread Oleg Tikhonov
Hi Ken, I would be choose the last option you've mentioned. -- Oleg On Sat, Aug 29, 2015 at 7:58 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, As part of integrating language-detector into Tika (see TIKA-1723), I noticed TIKA-546 (Add ability to create language profiles to

Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member

2015-09-17 Thread Oleg Tikhonov
Good intro. Welcome a board. Oleg On 17 Sep 2015 03:05, "David Meikle" wrote: > Hello All, > > Please welcome Bob Paulin as he joins us as the latest Tika committer and > PMC Member. > > Bob, please feel free to say a bit about yourself as an introduction to > the group. > >

Re: [DISCUSS] Moving to Git

2015-11-19 Thread Oleg Tikhonov
+1. There is a bunch of add-ons. For instance - git flow. On Wed, Nov 18, 2015 at 7:15 PM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Nick, > > Git has something similar to svn:externals: > > http://stackoverflow.com/questions/571232/svnexternals-equivalent-in-git >

Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-28 Thread Oleg Tikhonov
Hi Chris, thanks for doing it. Yesterday I successfuly build the tika using mvn clean install. All tests are passed. Platform: x86_64 Kubuntu with Oracle Java 8. Nothing special was ran. [x] +1 Release this package as Apache Tika 1.12 Best regards, Oleg On Mon, Jan 25, 2016 at 9:58 PM,

Re: Master Build Failing

2016-10-25 Thread Oleg Tikhonov
hi Luis, Here what I did: git clone https://git-wip-us.apache.org/repos/asf/tika.git git branch * master gdalinfo --version GDAL 1.11.3, released 2015/09/16 mvn clean install -U Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 42.59 sec - in

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Oleg Tikhonov
Hi, +1 for release. Built on Ubuntu 16.04 and CentOS 7.0 x86_64. All tests are passed. Java 8. BR, Oleg On Thu, Oct 20, 2016 at 5:54 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Tim > > I had exiftool installed indeed, so that might explain it. All tests now > pass. Will have

Re: 1.15?

2017-04-18 Thread Oleg Tikhonov
+1 for the release. On Mon, Apr 17, 2017 at 8:39 PM, David Meikle wrote: > +1 from me too. > > Cheers, > Dave > > On 13 April 2017 at 13:08, Konstantin Gribov wrote: > > > Preliminary +1 from me, I'll the a closer look this weekend > > > > чт, 13 апр. 2017,

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-12 Thread Oleg Tikhonov
[x]+1 Release this package as Apache Tika 1.16 Basic tests and build on Ubuntu 17.04 + Java 8 (Oracle). Thanks, Oleg On Wed, Jul 12, 2017 at 11:03 AM, Dave Meikle wrote: > On 8 July 2017 at 03:40, Tim Allison wrote: > > > > > A candidate for the Tika

Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Hi guys, Here is wrong ... org.apache.tika tika-parent 1.16-SNAPSHOT tika-parent/pom.xml If you are cloning the project, the upper level pom contains this. The fix is to change 1.16-SNAPSHOT to 1.15 What i did was: git clone https://github.com/apache/tika.git Any

Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Also put ./tika-dl/src/test/java/org/apache/tika/dl/imagerec/DL4JInceptionV3NetTest.java @Ignore because I do not have any DL installed on my comp. On Tue, May 23, 2017 at 11:00 PM, Oleg Tikhonov <o...@apache.org> wrote: > Hi guys, > Here is wrong ... > > org.apache.tika

Re: [VOTE] Release Apache Tika 1.15 Candidate #2

2017-05-24 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.15 [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 19:41 min [INFO] Finished at:

Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-24 Thread Oleg Tikhonov
@gmail.com] On Behalf Of > Oleg Tikhonov > Sent: Tuesday, May 23, 2017 4:33 PM > To: dev@tika.apache.org > Subject: Re: [VOTE] Release Apache Tika 1.15 Candidate #1 > > Also put > ./tika-dl/src/test/java/org/apache/tika/dl/imagerec/ > DL4JInceptionV3NetTest.java > @Ignore b

Re: experiences with Tika in Docker

2017-06-02 Thread Oleg Tikhonov
Guys, i can help with Tika dockerization. just let design/plan what we gonna do. On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh wrote: > As the Tika project starts embracing more non Java tools (I’m thinking of > Tesseract for example), dockerizing your Tika setup

Re: [jira] [Created] (TIKA-2647) Create a "security" page on our website

2018-05-22 Thread Oleg Tikhonov
Hi Tim, definitely would be helpful ! +1 Thanks, Oleg On Tue, May 22, 2018 at 3:38 PM, Tim Allison (JIRA) wrote: > Tim Allison created TIKA-2647: > - > > Summary: Create a "security" page on our website > Key:

Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
In this approach, probably it is the only way ... What is tika-server typical env? stand-alone, distributed ... like replicas in cluster? Are there some time limitation for recovery? How do we know what point to start processing from? Do we mark documents which were processed? For example, if

Re: [jira] [Created] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Hi Tim, What if watcher thread fails/gets stuck etc? On Thu, Sep 6, 2018 at 3:27 PM Tim Allison (JIRA) wrote: > Tim Allison created TIKA-2725: > - > > Summary: Make tika-server robust against ooms/infinite > loops/memory leaks >

Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Ideally, tika server is dockerized, runs on swarm as a service. In addition, it has healthckeck mechanism, say something ... like http get request with return code 200. Docker will runs this hc periodically, and if it fails, will restart tika server. However, we are far away. Two ways to go, fmpov

Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-07 Thread Oleg Tikhonov
Yep, seems to be best match... unblocked execution. On Thu, Sep 6, 2018, 23:47 Tim Allison (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606373#comment-16606373 > ] > > Tim Allison commented on

Re: [VOTE] Release Apache Tika 1.18 Candidate #1

2018-04-11 Thread Oleg Tikhonov
[+] Release this package as Apache Tika 1.18 [INFO] Apache Tika parent . SUCCESS [ 12.379 s] [INFO] Apache Tika core ... SUCCESS [ 55.650 s] [INFO] Apache Tika parsers SUCCESS [05:55 min] [INFO]

Re: [VOTE] Release Apache Tika 1.18 Candidate #3

2018-04-22 Thread Oleg Tikhonov
Hi, thanks a lot. [x] +1 Release this package as Apache Tika 1.18 Even did a security scan: mvn org.owasp:dependency-check-maven:3.1.2:check Report is attached. Best regards, Oleg On Sat, Apr 21, 2018 at 12:54 AM, talli...@apache.org wrote: > All, > A candidate for the

  1   2   >