[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262055#comment-16262055 ] Hudson commented on TIKA-2400: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1395 (See [https://builds.apache.org/job/Tika-trunk/1395/]) Update changes with TIKA-2400 / GH-208 (chris.a.mattmann: [https://github.com/apache/tika/commit/946614badc212eab8cd59a437ed28f07b14c2fc4]) * (edit) CHANGES.txt > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261082#comment-16261082 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-346096928 nevermind @ThejanW I did it This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261069#comment-16261069 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-346094114 @ThejanW can you please also remove the Docker files present in captioning/tf and in recognition/tf? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261065#comment-16261065 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann closed pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java b/tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java index d49ef0fed..5fd9d9a97 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java @@ -72,16 +72,16 @@ MediaType.image("gif") }))); -private static final String LABEL_LANG = "en"; +private static final String LABEL_LANG = "eng"; @Field -private URI apiBaseUri; +private URI apiBaseUri = URI.create("http://localhost:8764/inception/v3;); @Field -private int captions; +private int captions = 5; @Field -private int maxCaptionLength; +private int maxCaptionLength = 15; private URI apiUri; @@ -107,7 +107,7 @@ public boolean isAvailable() { public void initialize(Mapparams) throws TikaConfigException { try { healthUri = URI.create(apiBaseUri + "/ping"); -apiUri = URI.create(apiBaseUri + String.format(Locale.getDefault(), "/captions?beam_size=%1$d_caption_length=%2$d", +apiUri = URI.create(apiBaseUri + String.format(Locale.getDefault(), "/caption/image?beam_size=%1$d_caption_length=%2$d", captions, maxCaptionLength)); DefaultHttpClient client = new DefaultHttpClient(); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java index 37caf4538..a5a126ba9 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java @@ -55,11 +55,9 @@ * properties * parsers * parser class=org.apache.tika.parser.recognition.ObjectRecognitionParser - *mimeimage/jpeg/mime *params - * param name=topN type=int2/param - * param name=minConfidence type=double0.015/param * param name=class type=stringorg.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser/param + * param name=class type=stringorg.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner/param */params * /parser * /parsers @@ -83,12 +81,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; - -@Field -private int topN = 2; - private ObjectRecogniser recogniser; @Field(name = "class") @@ -102,7 +94,6 @@ public void initialize(Map params) throws TikaConfigException { recogniser.initialize(params); LOG.info("Recogniser = {}", recogniser.getClass().getName()); LOG.info("Recogniser Available = {}", recogniser.isAvailable()); -LOG.info("minConfidence = {}, topN={}", minConfidence, topN); } @Override @@ -140,29 +131,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String labelAndConfidence = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_IMG_CAP, labelAndConfidence); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; -if (object.getConfidence() >= minConfidence) { -count++; -LOG.info("Add {}", object); -String mdValue =
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261066#comment-16261066 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-346093396 yes, looks great! great job @ThejanW This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261049#comment-16261049 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-346091934 @chrismattmann can we merge this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240829#comment-16240829 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-342274754 @thammegowda @chrismattmann @smadha This is complete now. I have updated tensorflow version and models to the latest(tf 1.4.0). Currently object rec REST parsers are not functioning due to the URL change of imagenet_lsvrc_2015_synsets.txt & imagenet_metadata.txt. By this PR, those issues can also be resolved. Therefore it would be nice if we can merge this before 1.17. Testing instructions are included in the initial comment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205529#comment-16205529 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336805333 The new urls are, https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_metadata.txt This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205509#comment-16205509 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & **imagenet_metadata.txt**. Apparently tf maintainers have moved these meta files and models to another repo https://github.com/tensorflow/serving. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205506#comment-16205506 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & **imagenet_metadata.txt**. Apparently tf maintainers have moved these files to another location. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205504#comment-16205504 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336800656 I was getting the same error. Nothing is wrong with your docker setup. The problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & imagenet_metadata.txt. Apparently tf maintainers have moved these files to another location. See, https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt you will get 404. I'll update with the new URLs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205254#comment-16205254 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-336733795 I cant test image as well as video docker. ``` docker run -it -p 8764:8764 uscdatascience/inception-rest-tika Unable to find image 'uscdatascience/inception-rest-tika:latest' locally latest: Pulling from uscdatascience/inception-rest-tika 9fb6c798fa41: Already exists 3b61febd4aef: Already exists 9d99b9777eb0: Already exists d010c8cf75d7: Already exists 7fac07fb303e: Already exists 5601f0fca79b: Already exists dad2688af054: Already exists efa7176a3f6c: Already exists 5ba941a90099: Already exists b5a6f1155f94: Already exists 7e863f718dc4: Already exists Digest: sha256:20840a9c9e5cd2fed7d6c19ba38901c9ba6ec06fe0afe13b9f6624dc12e2 Status: Downloaded newer image for uscdatascience/inception-rest-tika:latest Can't import video libraries, No video functionality is available Traceback (most recent call last): File "/usr/bin/inceptionapi.py", line 265, in app = Classifier(__name__) File "/usr/bin/inceptionapi.py", line 221, in __init__ self.names = imagenet.create_readable_names_for_imagenet_labels() File "/models-c15fada28113eca32dc98d6e3bec4755d0d5b4c2/slim/datasets/imagenet.py", line 93, in create_readable_names_for_imagenet_labels assert num_synsets_in_ilsvrc == 1000 AssertionError ``` ``` $ docker run -it -p 8764:8764 uscdatascience/inception-video-rest-tika . cv2.__version__ 3.2.0 Traceback (most recent call last): File "/usr/bin/inceptionapi.py", line 265, in app = Classifier(__name__) File "/usr/bin/inceptionapi.py", line 221, in __init__ self.names = imagenet.create_readable_names_for_imagenet_labels() File "/models-c15fada28113eca32dc98d6e3bec4755d0d5b4c2/slim/datasets/imagenet.py", line 93, in create_readable_names_for_imagenet_labels assert num_synsets_in_ilsvrc == 1000 AssertionError ``` I will wait for others to test and confirm if the issue is with my docker setup or with the images This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187225#comment-16187225 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r142016934 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); Review comment: As of now to get label and confidence people have to split. I think traversing two arrays in a single loop will be easier than that. We can ensure that these two arrays are of same length. Also if you want JSON why don't store a serialised JSON in one metadata key, looks bad but better than a single String with space separated label and confidence. I'll leave it upto you guys. :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186973#comment-16186973 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-333291933 @thammegowda you can't test the video docker, because you haven't pulled the correct docker image. The docker image for video docker is `uscdatascience/inception-video-rest-tika`. Please see my initial comment of this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186968#comment-16186968 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r142000829 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: @thammegowda your understanding is exactly correct. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178404#comment-16178404 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140669630 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: Correct me if my understanding is wrong: - we have removed minConfidence and topN from ObjectRecognitionParser + We have added them to classes that implement `ObjectRecogniser` interface - Like TensorflowRestRecogniser, TensforflowRestImageCaptioner etc .. These are referred as _client_ in Thejan's terminology + We also have URL accompanying each _client_, which allow tweaking of these parameters. Food for Design thought: We might not have URLs for every client. to be specific - we could have a client using DL4J that doesn't use REST communication. So these parameters are required for the client and hence they should have it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178398#comment-16178398 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140669630 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: Correct me if my understanding is wrong: - we have removed minConfidence and topN from ObjectRecognitionParser + We have added them to classes that implement `ObjectRecogniser` interface - Like TensorflowRestRecogniser, TensforflowRestImageCaptioner etc .. These are referred as _client_ in Thejan's terminalogy + We also have URL accompanying each _client_, which allow tweaking of these parameters. Food for Design thought: We might not have URLs for every client. to be specific - we could have a client using DL4J that doesn't use REST communication. So these parameters are required for the client and hence they should have it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178403#comment-16178403 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140669424 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java ## @@ -107,7 +107,7 @@ public boolean isAvailable() { public void initialize(Mapparams) throws TikaConfigException { try { healthUri = URI.create(apiBaseUri + "/ping"); -apiUri = URI.create(apiBaseUri + String.format(Locale.getDefault(), "/captions?beam_size=%1$d_caption_length=%2$d", +apiUri = URI.create(apiBaseUri + String.format(Locale.getDefault(), "/caption/image?beam_size=%1$d_caption_length=%2$d", Review comment: Improvement: `String.format(Locale.getDefault()`, ...) and `String.format(...)` are equivalent right (default is inferred implicitely)? Rule of thumb - 1) When you have two options, pick the simple one! For me, latter one looks simple 2) If you want to enforce a specific locale, then it not same as default. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178401#comment-16178401 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140669367 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java ## @@ -75,13 +75,13 @@ private static final String LABEL_LANG = "en"; Review comment: Improvement: We should use `eng` as per [ISO 693-2](https://www.loc.gov/standards/iso639-2/php/code_list.php). Wish I knew this when I coded this up. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178402#comment-16178402 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140670497 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_IMG_CAP, mdVal); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; -if (object.getConfidence() >= minConfidence) { -count++; -LOG.info("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_OBJ_REC, mdValue); -acceptedObjects.add(object); -xhtmlIds.add(object.getId()); -if (count >= topN) { -break; -} -} else { -LOG.warn("Object {} confidence {} less than min {}", object, object.getConfidence(), minConfidence); -} +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_OBJ_REC, mdVal); +xhtmlIds.add(object.getId()); } +LOG.info("Add {}", object); Review comment: > will be great if you can remove String concatenation from RecognisedObject.toString to use StringBuffer or String format If you suggested this for performance gain, Let's take a deeper look. `RecognisedObject.toString()` does not run over a loop. Its just one giant concatenation with `+`. I remember reading somewhere that JDK can easily optimize such statement, but I couldn't find the source of this knowledge now so I am giving you this test : ```java class Main { public static long concat(int n){ long st = System.nanoTime(); for (int i = 0; i < n; i++) { String s = "a" + "b" + "c" + "d" + "e" + "f" + "g" + "h" + "i" + "j" +"k"; } return System.nanoTime() - st; } public static long builder(int n){ long st = System.nanoTime(); for (int i = 0; i < n; i++) { String s = new StringBuilder().append("a").append("b") .append("c").append("d").append("e").append("f") .append("g").append("h").append("i").append("j") .append("k").toString(); } return System.nanoTime() - st; } public static void main(String[] args) { int n = 1_000_000; System.out.printf("Builder Time in ns : %10d\n", builder(n)); System.out.printf(" Concat Time in ns : %10d\n", concat(n)); } } ``` I ran it on https://repl.it/languages/java ``` java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Builder Time in ns : 50614748 Concat Time in ns :2500615 ``` see, it's in fact better!! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter:
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178399#comment-16178399 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140670009 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); Review comment: > would be great if we can store object.getLabel() and object.getConfidence() into separate metadata fields. IMHO, it complicates metadata key-values. If we split, we get two arrays of confidence and labels, then users have to match labels with confidence using the index in arrays. One solution to this problem is still an open issue in Tika - i.e, support complex data structure like JSON for metadata. Until then we have full info captured in XHML content, so it should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178400#comment-16178400 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140670568 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; Review comment: Also in the future, wherever you want to use language code, please use ISO 639-2, which is `eng` for English. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176010#comment-16176010 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140421441 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: yes, minConfidence and topN can be set through CLI/ Tika Config since we have defined them in REST clients. In TensorflowRESTVideoRecogniser, you're extending TensorflowRESTRecogniser, that's why I have made some of the fields in TensorflowRESTRecogniser as protected(we need them there to derive apiUri and healthUri). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175997#comment-16175997 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140420790 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: please see, https://github.com/ThejanW/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java#L77-L84 https://github.com/ThejanW/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java#L79-L86 https://github.com/ThejanW/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTVideoRecogniser.java#L71-L72 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175996#comment-16175996 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140420704 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: sorry, I misunderstood your question, the reason why I have removed minConfidence and topN from objectRecognitionParser is, objectRecognitionParser does not need to keep such client specific parameters. Those client specific fields should be in that specific client, we are just using ObjectRecognitionParser to process objects from the respective REST client and put them in the xhtml. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175914#comment-16175914 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140411633 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; @Field -private URI apiUri = URI.create("http://localhost:8764/inception/v4/classify?topk=10;); +protected URI apiBaseUri = URI.create("http://localhost:8764/inception/v4;); + +@Field +protected int topN = 2; + @Field -private URI healthUri = URI.create("http://localhost:8764/inception/v4/ping;); +protected double minConfidence = 0.015; + +protected URI apiUri; + +protected URI healthUri; Review comment: You can still keep a default value by extracting String constants and deriving a default value too. No big deal though This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175913#comment-16175913 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325811 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_IMG_CAP, mdVal); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; -if (object.getConfidence() >= minConfidence) { -count++; -LOG.info("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_OBJ_REC, mdValue); -acceptedObjects.add(object); -xhtmlIds.add(object.getId()); -if (count >= topN) { -break; -} -} else { -LOG.warn("Object {} confidence {} less than min {}", object, object.getConfidence(), minConfidence); -} +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_OBJ_REC, mdVal); +xhtmlIds.add(object.getId()); } +LOG.info("Add {}", object); Review comment: - [ ] Thanks for following good logging practice of using `{}`. will be great if you can remove String concatenation from [`RecognisedObject.toString`](https://github.com/ThejanW/tika/blob/92c65e0a43e7f09a0566bec34f352314dffe5def/tika-parsers/src/main/java/org/apache/tika/parser/recognition/RecognisedObject.java#L84-L90) to use `StringBuffer` or `String format`. You can do it through IDE with few clicks. Thanks in advance for cleanup This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175911#comment-16175911 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140411401 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: @ThejanW For my understanding `minConfidence` and `topN` can still be tweaked through Tika config / CLI options right? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173466#comment-16173466 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140024736 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; @Field -private URI apiUri = URI.create("http://localhost:8764/inception/v4/classify?topk=10;); +protected URI apiBaseUri = URI.create("http://localhost:8764/inception/v4;); + +@Field +protected int topN = 2; + @Field -private URI healthUri = URI.create("http://localhost:8764/inception/v4/ping;); +protected double minConfidence = 0.015; + +protected URI apiUri; + +protected URI healthUri; Review comment: I have defined a apiBaseUri and using that practice in all REST clients, using that apiBaseUri, we can derive healthUri and apiUri, see https://github.com/ThejanW/tika/blob/2a81e975e48f2d1e051920725221fc5341e6db5f/tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java#L111-L112 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173452#comment-16173452 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140022863 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: Hey! good catch..it's not easy maintaining comments like these(https://github.com/ThejanW/tika/blob/92c65e0a43e7f09a0566bec34f352314dffe5def/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java#L49-L70) A future developers will also miss these. Will update them asap :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173438#comment-16173438 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140021556 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: Yeah, I have moved minConfidence logic to REST servers, it is kind of odd to ask for topk objects from the backend and filter those objects again in the client with related to minConfidence and select topN objects, just too much logic in the client. we can directly ask the backend to give us topN objects which has a confidence greater than the minConfidence, less iterations and simplified client :100: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172812#comment-16172812 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139329533 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: If you plan to put these controls in REST URI then please leave it somewhere in comments and wiki too. Also, this needs to be updated in comments too - https://github.com/ThejanW/tika/blob/92c65e0a43e7f09a0566bec34f352314dffe5def/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java#L60 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172806#comment-16172806 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325945 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; Review comment: - [ ] Will be great if you can put the reason to make it `protected` in comments so no one changes it in future. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172799#comment-16172799 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326146 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -183,4 +164,4 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad metadata.add("no.objects", Boolean.TRUE.toString()); } } -} Review comment: - [ ] Extra line break This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172796#comment-16172796 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325333 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: - [ ] Any specific reason to remove `minConfidence` and `topN` ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172804#comment-16172804 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325490 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); Review comment: - [ ] would be great if we can store `object.getLabel()` and `object.getConfidence()` into separate metadata fields. Like creating a new key `MD_KEY_CAP_CONFIDENCE` for storing confidence, instead of wrapping them both in a single `String mdVal`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172802#comment-16172802 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325427 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); Review comment: - [ ] Can we please rename `mdVal` to something more related to the value of this variable? Like `imageLabelAndConfidence` or ``objectLabelAndConfidence`` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172790#comment-16172790 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325563 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_IMG_CAP, mdVal); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; -if (object.getConfidence() >= minConfidence) { -count++; -LOG.info("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_OBJ_REC, mdValue); -acceptedObjects.add(object); -xhtmlIds.add(object.getId()); -if (count >= topN) { -break; -} -} else { -LOG.warn("Object {} confidence {} less than min {}", object, object.getConfidence(), minConfidence); -} +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); Review comment: - [ ] same comments, variable name and seperate metadata key This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172797#comment-16172797 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326190 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTVideoRecogniser.java ## @@ -17,63 +17,91 @@ package org.apache.tika.parser.recognition.tf; +import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.URI; -import java.util.Collections; +import java.util.Locale; +import java.util.Map; import java.util.Set; +import java.util.Collections; +import java.util.HashSet; import javax.ws.rs.core.UriBuilder; +import org.apache.http.HttpResponse; +import org.apache.http.client.methods.HttpGet; +import org.apache.http.client.methods.HttpPost; +import org.apache.http.entity.ByteArrayEntity; +import org.apache.http.impl.client.DefaultHttpClient; import org.apache.tika.Tika; import org.apache.tika.config.Field; +import org.apache.tika.config.Param; import org.apache.tika.config.TikaConfig; +import org.apache.tika.exception.TikaConfigException; +import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MimeType; import org.apache.tika.mime.MimeTypeException; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.recognition.RecognisedObject; +import org.json.JSONArray; +import org.json.JSONObject; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; /** * Tensor Flow video recogniser which has high performance. * This implementation uses Tensorflow via REST API. * - * NOTE : //TODO: link to wiki page here + * NOTE : https://wiki.apache.org/tika/TikaAndVisionVideo * * @since Apache Tika 1.15 */ -public class TensorflowRESTVideoRecogniser extends TensorflowRESTRecogniser{ +public class TensorflowRESTVideoRecogniser extends TensorflowRESTRecogniser { Review comment: - [ ] Extra space This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172792#comment-16172792 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139329533 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field -private double minConfidence = 0.05; Review comment: If you plan to put in REST URI then please leave it somewhere in comments too. Also, this needs to be updated in comments too - https://github.com/ThejanW/tika/blob/92c65e0a43e7f09a0566bec34f352314dffe5def/tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java#L60 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172794#comment-16172794 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139329240 ## File path: tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/video_util.py ## @@ -1,5 +1,5 @@ #!/usr/bin/env python -# Review comment: I guess there are very few actual changes in this file but mostly extra spaces and new lines. Though your code is great I'll suggest few of extra spaces and new lines in future as it brings focus to actual change only. Makes sense? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172798#comment-16172798 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325811 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; -LOG.debug("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_IMG_CAP, mdValue); -acceptedObjects.add(object); +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_IMG_CAP, mdVal); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; -if (object.getConfidence() >= minConfidence) { -count++; -LOG.info("Add {}", object); -String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", -object.getLabel(), object.getConfidence()); -metadata.add(MD_KEY_OBJ_REC, mdValue); -acceptedObjects.add(object); -xhtmlIds.add(object.getId()); -if (count >= topN) { -break; -} -} else { -LOG.warn("Object {} confidence {} less than min {}", object, object.getConfidence(), minConfidence); -} +String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); +metadata.add(MD_KEY_OBJ_REC, mdVal); +xhtmlIds.add(object.getId()); } +LOG.info("Add {}", object); Review comment: - [ ] Thanks for following good logging practice if using `{}`. will be great if you can remove String concatenation from [`RecognisedObject.toString`](https://github.com/ThejanW/tika/blob/92c65e0a43e7f09a0566bec34f352314dffe5def/tika-parsers/src/main/java/org/apache/tika/parser/recognition/RecognisedObject.java#L84-L90) to use `StringBuffer` or `String format`. You can do it through IDE with few clicks. Thanks in advance for cleanup This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172791#comment-16172791 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326176 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTVideoRecogniser.java ## @@ -17,63 +17,91 @@ package org.apache.tika.parser.recognition.tf; +import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.URI; -import java.util.Collections; +import java.util.Locale; +import java.util.Map; import java.util.Set; +import java.util.Collections; +import java.util.HashSet; import javax.ws.rs.core.UriBuilder; +import org.apache.http.HttpResponse; +import org.apache.http.client.methods.HttpGet; +import org.apache.http.client.methods.HttpPost; +import org.apache.http.entity.ByteArrayEntity; +import org.apache.http.impl.client.DefaultHttpClient; import org.apache.tika.Tika; import org.apache.tika.config.Field; +import org.apache.tika.config.Param; import org.apache.tika.config.TikaConfig; +import org.apache.tika.exception.TikaConfigException; +import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MimeType; import org.apache.tika.mime.MimeTypeException; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.recognition.RecognisedObject; +import org.json.JSONArray; +import org.json.JSONObject; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; /** * Tensor Flow video recogniser which has high performance. * This implementation uses Tensorflow via REST API. * - * NOTE : //TODO: link to wiki page here + * NOTE : https://wiki.apache.org/tika/TikaAndVisionVideo Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172800#comment-16172800 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326006 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; @Field -private URI apiUri = URI.create("http://localhost:8764/inception/v4/classify?topk=10;); +protected URI apiBaseUri = URI.create("http://localhost:8764/inception/v4;); + +@Field +protected int topN = 2; + @Field -private URI healthUri = URI.create("http://localhost:8764/inception/v4/ping;); +protected double minConfidence = 0.015; + +protected URI apiUri; + +protected URI healthUri; Review comment: - [ ] Why remove default value? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172803#comment-16172803 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326151 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -160,4 +175,4 @@ public void checkInitialization(InitializableProblemHandler handler) LOG.debug("Num Objects found {}", recObjs.size()); return recObjs; } -} +} Review comment: - [ ] Extra line break This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172801#comment-16172801 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325945 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -73,19 +74,27 @@ /** * Maximum buffer size for image */ -private static final String LABEL_LANG = "en"; +protected static final String LABEL_LANG = "en"; Review comment: - [ ] Will be great if you can put the reason in comments so no one changes it in future. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172793#comment-16172793 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325921 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java ## @@ -56,7 +57,7 @@ * Tensor Flow image recogniser which has high performance. * This implementation uses Tensorflow via REST API. * - * NOTE : //TODO: link to wiki page here + * NOTE : https://wiki.apache.org/tika/TikaAndVision Review comment: Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172795#comment-16172795 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139326197 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTVideoRecogniser.java ## @@ -17,63 +17,91 @@ package org.apache.tika.parser.recognition.tf; +import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.URI; -import java.util.Collections; +import java.util.Locale; +import java.util.Map; import java.util.Set; +import java.util.Collections; +import java.util.HashSet; import javax.ws.rs.core.UriBuilder; +import org.apache.http.HttpResponse; +import org.apache.http.client.methods.HttpGet; +import org.apache.http.client.methods.HttpPost; +import org.apache.http.entity.ByteArrayEntity; +import org.apache.http.impl.client.DefaultHttpClient; import org.apache.tika.Tika; import org.apache.tika.config.Field; +import org.apache.tika.config.Param; import org.apache.tika.config.TikaConfig; +import org.apache.tika.exception.TikaConfigException; +import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MimeType; import org.apache.tika.mime.MimeTypeException; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.recognition.RecognisedObject; +import org.json.JSONArray; +import org.json.JSONObject; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; /** * Tensor Flow video recogniser which has high performance. * This implementation uses Tensorflow via REST API. * - * NOTE : //TODO: link to wiki page here + * NOTE : https://wiki.apache.org/tika/TikaAndVisionVideo * * @since Apache Tika 1.15 */ -public class TensorflowRESTVideoRecogniser extends TensorflowRESTRecogniser{ +public class TensorflowRESTVideoRecogniser extends TensorflowRESTRecogniser { -private static final Logger LOG = LoggerFactory.getLogger(TensorflowRESTRecogniser.class); +private static final Logger LOG = LoggerFactory.getLogger(TensorflowRESTVideoRecogniser.class); Review comment: Super thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171165#comment-16171165 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139603908 ## File path: tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/InceptionVideoRestDockerfile ## @@ -61,31 +48,22 @@ RUN make -j4 RUN make install WORKDIR / - -# Install tensorflow and other dependencies -RUN \ - pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.1-cp27-none-linux_x86_64.whl --ignore-installed && \ - pip install flask requests pillow - -# Get the TF-slim dependencies -# Downloading from a specific commit for future compatibility -RUN wget https://github.com/tensorflow/models/archive/c15fada28113eca32dc98d6e3bec4755d0d5b4c2.zip - -RUN unzip c15fada28113eca32dc98d6e3bec4755d0d5b4c2.zip - RUN \ - wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \ - wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/video_util.py -O /usr/bin/video_util.py && \ + wget https://raw.githubusercontent.com/ThejanW/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \ Review comment: will do once, merged :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171164#comment-16171164 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139603897 ## File path: tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/Im2txtRestDockerfile ## @@ -46,7 +43,7 @@ RUN \ wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/caption_generator.py \ -O caption_generator.py && \ -wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/im2txtapi.py \ +wget https://raw.githubusercontent.com/ThejanW/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/im2txtapi.py \ Review comment: will do once, merged :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170701#comment-16170701 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139537923 ## File path: tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/InceptionVideoRestDockerfile ## @@ -61,31 +48,22 @@ RUN make -j4 RUN make install WORKDIR / - -# Install tensorflow and other dependencies -RUN \ - pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.1-cp27-none-linux_x86_64.whl --ignore-installed && \ - pip install flask requests pillow - -# Get the TF-slim dependencies -# Downloading from a specific commit for future compatibility -RUN wget https://github.com/tensorflow/models/archive/c15fada28113eca32dc98d6e3bec4755d0d5b4c2.zip - -RUN unzip c15fada28113eca32dc98d6e3bec4755d0d5b4c2.zip - RUN \ - wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \ - wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/video_util.py -O /usr/bin/video_util.py && \ + wget https://raw.githubusercontent.com/ThejanW/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \ Review comment: reminder to change back after applying This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170700#comment-16170700 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139537857 ## File path: tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/Im2txtRestDockerfile ## @@ -46,7 +43,7 @@ RUN \ wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/caption_generator.py \ -O caption_generator.py && \ -wget https://raw.githubusercontent.com/apache/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/im2txtapi.py \ +wget https://raw.githubusercontent.com/ThejanW/tika/master/tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/im2txtapi.py \ Review comment: Reminder this needs to be changed back This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170102#comment-16170102 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-330246151 @chrismattmann @thammegowda yeah! lemme configure docker builds in uscdatascience :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169526#comment-16169526 ] ASF GitHub Bot commented on TIKA-2400: -- chrismattmann commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-330115750 Yes please use @uscdataacience thanks dudes This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169501#comment-16169501 ] ASF GitHub Bot commented on TIKA-2400: -- thammegowda commented on issue #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#issuecomment-330109421 @ThejanW Great work. Looks like you have done lot of cleaning, so here is another Please publish these docker images under some organization. Since we cannot use `apache` organization under docker hub, lets just use https://hub.docker.com/u/uscdatascience/, I gave you all the permissions. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169479#comment-16169479 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325308 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field Review comment: Any specific reason to remove `minConfidence` and `topN` ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169477#comment-16169477 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r139325308 ## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ## @@ -83,12 +83,6 @@ public int compare(RecognisedObject o1, RecognisedObject o2) { } }; -@Field Review comment: Any specific reason to remove `minConfidence` and `topN` ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169285#comment-16169285 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW opened a new pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208 **This PR consists of,** 1. Reformatting related to inceptionapi.py, im2txtapi.py and related Java clients 2. Logic implementations for checking min confidence in server side instead of client side in Object Recognition REST parsers 3. Refactoring to docker files **How to test?** 1. `docker run -it -p 8764:8764 thejanw/inception-rest-tika` - then run the tests in **ObjectRecognitionParserTest** class 2. `docker run -it -p 8764:8764 thejanw/im2txt-rest-tika` - then run the tests in **ObjectRecognitionParserTest** class 3. `docker run -it -p 8764:8764 thejanw/inception-video-rest-tika` - then run the tests in **TensorflowVideoRecParserTest** class This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > - > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Thejan Wijesinghe >Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)